Software Engineer, Site Reliability (SRE)

Sierra•San Francisco, CA

79d

About The Position

As a Software Engineer on our Site Reliability team at Sierra, you will be responsible for defining and building the foundation of reliability, observability, and scalability across Sierra’s AI-driven infrastructure. You’ll partner closely with our core engineering and product teams to ensure our systems are highly available, efficient, and built for growth. Own Sierra’s observability stack—monitoring, alerting, logging, and tracing—to give engineers clear visibility into system health and performance. Partner with product and platform engineers to design systems that are reliable and scalable from day one—not as an afterthought. Design and implement scalable, reliable, and secure cloud infrastructure (AWS) using Terraform and modern DevOps tooling. Improve the reliability and scalability of our LLM deployments, ensuring robust, performant, and cost-effective operation. Lead improvements to deployment pipelines, CI/CD tooling, and incident management processes to reduce downtime and response time. Define the foundation of SRE practices at Sierra, influencing culture, tooling, and best practices across the engineering org.

Requirements

5+ years of hands-on experience in Site Reliability or Infrastructure engineering roles for complex SaaS or cloud-based systems.
Experience designing for availability, scalability, and reliability at both infrastructure and application layers.
Deep experience with Terraform, AWS services, container orchestration, and cloud networking.
Strong background in observability systems (e.g., Prometheus, Grafana, Datadog, or similar).
Experience working with enterprise customers and familiarity with their compliance and networking needs.
Comfortable working in fast-moving environments and collaborating across product, ML, and core engineering teams.
Degree in Computer Science or a related field, or equivalent professional experience.

Nice To Haves

Experience with LLM infrastructure — optimizing inference performance, managing fine-tuned models, or large-scale model deployment.
Past experience in an early-stage startup environment, especially defining SRE culture and tooling from scratch.
Familiarity with incident management automation or self-healing infrastructure patterns.

Responsibilities

Define and build the foundation of reliability, observability, and scalability across Sierra’s AI-driven infrastructure.
Own Sierra’s observability stack—monitoring, alerting, logging, and tracing.
Partner with product and platform engineers to design reliable and scalable systems.
Design and implement scalable, reliable, and secure cloud infrastructure (AWS) using Terraform.
Improve the reliability and scalability of LLM deployments.
Lead improvements to deployment pipelines, CI/CD tooling, and incident management processes.
Define the foundation of SRE practices at Sierra.

Benefits

Flexible (Unlimited) Paid Time Off
Medical, Dental, and Vision benefits for you and your family
Life Insurance and Disability Benefits
Retirement Plan (e.g., 401K, pension) with Sierra match
Parental Leave
Fertility and family building benefits through Carrot
Lunch, as well as delicious snacks and coffee to keep you energized
Discretionary Benefit Stipend giving people the ability to spend where it matters most
Free alphorn lessons

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

Bachelor's degree

Number of Employees

251-500 employees

Software Engineer, Site Reliability (SRE)

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company