Junior Site Reliability Engineer

Lightspeed SystemsAustin, TX
8d

About The Position

Lightspeed Systems is a global leader in education technology, providing AI-powered solutions that keep students safe, engaged, and learning. We are committed to building tools that empower schools and districts while shaping the future of digital learning. We are seeking talented individuals who are passionate about innovation, problem-solving, and meaningful impact. We are currently looking for a Junior Site Reliability Engineer to join team. Our infrastructure powers a global platform serving thousands of schools. You’ll join a team that ensures our systems are reliable, scalable, and forward-thinking — especially as we expand our use of automation and AI. If you’re excited about building systems, automating operations, and applying agentic AI to real-world SRE challenges, you’ll thrive here. Please note: Sponsorship is not provided for this position. ABOUT THE ROLE Develop a deep understanding of one or more infrastructure services and their role in our platform. Use Terraform to design, deploy, and maintain infrastructure as code (IaC). Automate workflows and deployments with GitHub Actions, leveraging GitHub Copilot and other AI-assisted tools to improve reliability and speed. Explore agentic automation (e.g., incident triage, self-healing scripts, automated runbook generation) to advance our SRE capabilities. Participate in Level 1 on-call support: monitor, respond, escalate, and improve system stability. Assist with performance, load, and stress testing of web applications to identify bottlenecks and durability issues. Implement, maintain, and enhance observability and monitoring (e.g., via Datadog or similar). Track work in Jira, communicate clearly about progress and blockers, and collaborate across teams (Platform, Product, QA, etc.). Participate in incident response and post-mortems, contributing to reliability improvements.

Requirements

  • 1–2 years in a DevOps, SRE, or infrastructure role (or equivalent experience).
  • Strong problem-solving and troubleshooting mindset.
  • Excellent communication skills, curiosity, and willingness to learn in a distributed team environment.
  • Interest in applying AI-assisted development tools and automation to SRE workflows.
  • Basic programming experience (Go, Python, or JavaScript/Node.js).
  • Solid Linux administration fundamentals.
  • Experience with containers and orchestration (e.g., Docker, AWS Fargate, ECS).
  • Hands-on exposure to Terraform and IaC concepts.
  • Familiarity with AWS services and cloud infrastructure fundamentals.
  • Exposure to CI/CD workflows (GitHub, GitHub Actions, etc.).
  • Awareness of observability, monitoring, and logging practices.

Nice To Haves

  • Experience or interest in agentic AI frameworks or LLM-based automation for scripting, diagnostics, or incident management.
  • Hands-on familiarity with performance/stress-testing tools (k6, Locust, JMeter) and monitoring web applications under load.
  • Familiarity with cloud networking, security best practices, and key AWS services (API Gateway, Lambda, ECS, DynamoDB, OpenSearch, Redis, PostgreSQL).
  • AWS SysOps Administrator – Associate certification (or equivalent experience).

Responsibilities

  • Develop a deep understanding of one or more infrastructure services and their role in our platform.
  • Use Terraform to design, deploy, and maintain infrastructure as code (IaC).
  • Automate workflows and deployments with GitHub Actions, leveraging GitHub Copilot and other AI-assisted tools to improve reliability and speed.
  • Explore agentic automation (e.g., incident triage, self-healing scripts, automated runbook generation) to advance our SRE capabilities.
  • Participate in Level 1 on-call support: monitor, respond, escalate, and improve system stability.
  • Assist with performance, load, and stress testing of web applications to identify bottlenecks and durability issues.
  • Implement, maintain, and enhance observability and monitoring (e.g., via Datadog or similar).
  • Track work in Jira, communicate clearly about progress and blockers, and collaborate across teams (Platform, Product, QA, etc.).
  • Participate in incident response and post-mortems, contributing to reliability improvements.

Benefits

  • Health -- Medical, dental and vision insurance with healthy company contribution toward premiums. Lightspeed kicks cash into your HSA if you participate in our HDHP.
  • Wellness -- Paid parental leave. Healthy holiday and PTO policy, including Christmas to New Year’s Day break.
  • Retirement -- 401(k) matching up to 6%
  • Other -- Work from where it makes sense. Pet insurance.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service