Software Engineer, Site Reliability (SRE)

SierraSan Francisco, CA
1dOnsite

About The Position

At Sierra, we’re creating a platform to help businesses build better, more human customer experiences with AI. We are primarily an in-person company based in San Francisco, with growing offices in Atlanta, New York, London, France, Singapore, and Japan. We are guided by a set of values that are at the core of our actions and define our culture: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These values are the foundation of our work, and we are committed to upholding them in everything we do. Our co-founders are Bret Taylor and Clay Bavor . Bret currently serves as Board Chair of OpenAI. Previously, he was co-CEO of Salesforce (which had acquired the company he founded, Quip) and CTO of Facebook. Bret was also one of Google's earliest product managers and co-creator of Google Maps. Before founding Sierra, Clay spent 18 years at Google, where he most recently led Google Labs. Earlier, he started and led Google’s AR/VR effort, Project Starline, and Google Lens. Before that, Clay led the product and design teams for Google Workspace. As a Software Engineer on our Site Reliability team at Sierra, you will be responsible for defining and building the foundation of reliability, observability, and scalability across Sierra’s AI-driven infrastructure. You’ll partner closely with our core engineering and product teams to ensure our systems are highly available, efficient, and built for growth.

Requirements

  • 5+ years of hands-on experience in Site Reliability or Infrastructure engineering roles for complex SaaS or cloud-based systems.
  • Experience designing for availability, scalability, and reliability at both infrastructure and application layers.
  • Deep experience with Terraform, AWS services, container orchestration, and cloud networking (including IAM and VPC architecture).
  • Strong background in observability systems (e.g., Prometheus, Grafana, Datadog, or similar).
  • Experience working with enterprise customers and familiarity with their compliance and networking needs along with integration patterns.
  • Comfortable working in fast-moving environments and collaborating across product, ML, and core engineering teams.
  • Degree in Computer Science or a related field, or equivalent professional experience.

Nice To Haves

  • Experience with LLM infrastructure — optimizing inference performance, managing fine-tuned models, or large-scale model deployment.
  • Past experience in an early-stage startup environment, especially defining SRE culture and tooling from scratch.
  • Familiarity with incident management automation or self-healing infrastructure patterns.

Responsibilities

  • Own Sierra’s observability stack—monitoring, alerting, logging, and tracing—to give engineers clear visibility into system health and performance.
  • Partner with product and platform engineers to design systems that are reliable and scalable from day one—not as an afterthought.
  • Design and implement scalable, reliable, and secure cloud infrastructure (AWS) using Terraform and modern DevOps tooling.
  • Improve the reliability and scalability of our LLM deployments, ensuring robust, performant, and cost-effective operation.
  • Lead improvements to deployment pipelines, CI/CD tooling, and incident management processes to reduce downtime and response time.
  • Define the foundation of SRE practices at Sierra, influencing culture, tooling, and best practices across the engineering org.

Benefits

  • Flexible (Unlimited) Paid Time Off
  • Medical, Dental, and Vision benefits for you and your family
  • Life Insurance and Disability Benefits
  • Retirement Plan (e.g., 401K, pension) with Sierra match
  • Parental Leave
  • Fertility and family building benefits through Carrot
  • Lunch, as well as delicious snacks and coffee to keep you energized
  • Discretionary Benefit Stipend giving people the ability to spend where it matters most
  • Free alphorn lessons
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service