About The Position

We are seeking a Senior Site Reliability Engineer (SRE) with 7+ years of experience designing, operating, and scaling highly reliable distributed systems. In this role, you will be a key technical leader within the Cloud Platform team, responsible for ensuring the availability, performance, and resilience of mission-critical services supporting autonomy, simulation, and data-intensive workloads. You will work closely with Cloud Platform, DevOps, Data Engineering, and Autonomy teams to establish reliability standards, improve operational maturity, and build systems that scale safely under real-world conditions. The ideal candidate is deeply technical, calm under pressure, and experienced in owning reliability outcomes end-to-end.

Requirements

  • 7+ years of experience in SRE, infrastructure, or systems engineering roles.
  • Strong experience operating large-scale distributed production systems.
  • Deep understanding of Linux systems, networking, and distributed systems fundamentals.
  • Hands-on experience with Kubernetes and container orchestration.
  • Programming or scripting experience in Go, Python, or similar languages.
  • Experience designing and operating observability systems for production environments.
  • Proven ability to lead incident response and reliability improvements.
  • Strong communication skills and ability to collaborate across engineering teams.
  • Must be a US Citizen.
  • Must be Eligible to obtain a Government Clearance - if required.

Nice To Haves

  • Experience supporting autonomy, robotics, simulation, or real-time systems.
  • Familiarity with AWS and large-scale cloud infrastructure.
  • Experience with chaos engineering, fault injection, or resilience testing.
  • Knowledge of CI/CD systems and progressive delivery practices.
  • Experience working in high-reliability or safety-critical environments.

Responsibilities

  • Design and evolve reliability architecture for distributed and cloud-hosted systems.
  • Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning.
  • Partner with platform and application teams to design systems for reliability, scalability, and operability.
  • Identify and mitigate systemic reliability risks across infrastructure and services.
  • Lead incident response processes including on-call rotations, escalation, and post-incident reviews.
  • Conduct root cause analysis for complex production incidents and drive long-term improvements.
  • Improve operational readiness through runbooks, automation, and resilience testing.
  • Reduce operational toil through tooling, automation, and process improvements.
  • Design and maintain observability systems for metrics, logging, tracing, and alerting.
  • Ensure services and data pipelines are observable, debuggable, and performant in production.
  • Drive performance analysis and tuning across infrastructure and service layers.
  • Build automation to improve system reliability, deployment safety, and recovery processes.
  • Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns.
  • Support and improve Kubernetes-based environments and containerized workloads.
  • Collaborate with security teams to ensure secure and resilient system design.
  • Participate in disaster recovery planning and testing.
  • Maintain strong operational practices around access control, secrets management, and change management.

Benefits

  • 100% Employer paid Health, Dental and Vision Insurance for you and your families
  • Life Insurance (Employer Paid)
  • Ability to participate in the companies 401k program (Matching)
  • Unlimited PTO policy with an enforced 2 week minimum
  • Equity Package
  • Work / Home Office Stipend
  • Global Entry
  • 16 Week Paid Parental Leave
  • Monthly Health and Wellness Stipend
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service