Senior Site Reliability Engineer

David AI•San Francisco, CA

1d•$160,000 - $220,000•Onsite

About The Position

As a Senior Site Reliability Engineer at David AI, you will shape and build the foundation for reliability, observability, and scalability across David AI's infrastructure. Working closely with our engineering and product teams, you’ll help ensure our systems are resilient, efficient, and designed to scale as the company grows.

Requirements

5+ years of experience in Site Reliability, Infrastructure, or Platform Engineering supporting large-scale SaaS or cloud systems.
Hands-on experience applying Security best practices in production systems and cloud infrastructure.
Strong experience building and running reliable, highly available, and scalable systems.
Hands-on experience with AWS, Terraform, containers (like Kubernetes), and cloud networking basics.
Experience implementing and maintaining observability tooling across monitoring, logging, alerting, and tracing (e.g., Prometheus, Grafana, Datadog, or similar).
Comfortable working in fast-paced teams and collaborating closely with product, ML, and engineering teams.
Bachelor’s degree in Computer Science or related field, or equivalent practical experience.

Nice To Haves

Past experience in an early-stage startup environment, especially defining SRE culture and tooling from scratch.
Familiarity with incident management automation or self-healing infrastructure patterns.

Responsibilities

Own David AI’s observability stack, including monitoring, alerting, logging, and tracing, to provide engineers with clear visibility into system health, reliability, and performance.
Partner closely with product and platform engineering teams to design systems that are scalable, resilient, and reliable from day one, not as an afterthought.
Design and implement secure, scalable cloud infrastructure across AWS using Terraform and modern DevOps tooling to support rapid product and research iteration.
Lead improvements across deployment pipelines, CI/CD systems, and incident response processes to reduce downtime, improve operational efficiency, and strengthen engineering velocity.
Define and evolve the foundation of SRE practices at David AI, influencing reliability culture, tooling standards, operational excellence, and best practices across the engineering organization.

Benefits

Unlimited PTO.
Top-notch health, dental, and vision coverage with 100% coverage for most plans.
FSA & HSA access.
401k access.
Meals 2x daily through DoorDash + snacks and beverages available at the office.
Unlimited company-sponsored Barry’s classes.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume