Senior Site Reliability Engineer

Drata•San Francisco, CA

59d•Hybrid

About The Position

Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a close-knit SRE team where you grow your career, shape standards, and collaborate with peers - while also serving as the dedicated reliability partner for one of Drata's product engineering teams across the full lifecycle of their work. This is a highly technical role at the intersection of software engineering and systems engineering. The best SREs at Drata are engineers first: they solve problems by building solutions, not by executing manual processes. Automation is a core value, and nowhere is that more visible than in how we approach reliability. Our infrastructure runs on AWS across multiple accounts, defined entirely in Terraform. You'll work across a modern cloud-native stack to help Drata scale reliably for a rapidly growing customer base.

Requirements

6+ years of experience in Site Reliability Engineering, Cloud Engineering, or building and maintaining scalable, resilient services
Robust knowledge of cloud computing technologies: Terraform, Docker, Git, and Linux
Hands-on experience with Datadog for monitoring, alerting, dashboards, SLO tracking, and distributed tracing
Experience building software systems as a software engineer
Experience developing tooling and automation in Python and/or Bash
Experience with CI/CD pipeline automation, specifically GitHub Actions
Experience with disaster recovery practices and incident management
Strong understanding of observability concepts - monitoring, logging, distributed tracing, and metrics - and how to apply them to production systems
Experience with container orchestration and deployment technologies including AWS ECS Fargate and/or Kubernetes
Experience working with relational databases (MySQL proficiency is a plus)
Ability to take ownership of problems and act on them independently in a constantly evolving environment
Hands-on experience using AI-assisted development tools (e.g., GitHub Copilot, Cursor, or similar) to accelerate automation, scripting, or infrastructure work
Demonstrated use of AI/AIOps capabilities for reliability tasks - anomaly detection, incident triage, runbook generation, or alert noise reduction
Familiarity with the operational characteristics of AI/ML-backed services and what it means to make them observable and reliable in production
Demonstrated passion for AI through personal projects, contributions, or continuous learning in the context of infrastructure or reliability engineering

Nice To Haves

Experience with AIOps - using AI/ML-based tooling for anomaly detection, predictive alerting, or automated incident triage
Familiarity with the reliability characteristics of AI/ML-backed services (e.g., LLM inference latency, non-determinism, prompt pipeline observability)
Experience with the JavaScript/Node.js ecosystem
Certified Kubernetes Administrator (CKA) certification
Familiarity with compliance frameworks like SOC 2, ISO 27001, or NIST

Responsibilities

Reliability Architecture for Your Product Team: You are the reliability expert for your aligned product team. You engage early - during architecture reviews and design discussions - to surface risks before they become incidents.
Lead Production Readiness Reviews (PRRs) before new services launch, with the authority to flag gaps and gate launches when critical reliability standards aren't met
Partner with product engineering leads and staff engineers to define SLOs and SLIs for critical services, turning reliability from a vague goal into a measurable commitment
Participate in team planning and architecture reviews to provide proactive reliability guidance
Build reusable artifacts - SLO templates, observability checklists, alerting standards, reference dashboards - that raise the reliability floor across the team, not just the services you touch directly
Eliminating Toil Through Engineering: You handle operational needs from your product team, but your job isn't to be a help desk. Your goal is to make each request the last of its kind. When an engineer needs something, your priority is: automate it so anyone can do it → document it so the team can self-serve → execute it manually only as a last resort.
Build and maintain Datadog monitors, dashboards, and alert routing - enforcing infrastructure-as-code standards via Terraform so those resources are owned, versioned, and auditable
Handle infrastructure requests: ECS task management, secret rotations, Terraform changes, capacity adjustments
Identify repeated manual work and convert it into self-service tooling or runbooks
Audit existing services for reliability anti-patterns and surface top risks before they cause incidents
Central SRE Platform Work: Beyond your product team, you contribute to cross-cutting infrastructure, tooling, and standards that benefit every team at Drata. Recent examples include automated Datadog governance workflows, dynamic AWS account provisioning, and disaster recovery exercises.
Design and build shared platform infrastructure - reusable Terraform modules, standardized observability stacks, service templates - so reliability improvements compound across the organization
Participate in the on-call rotation and lead incident response when needed; conduct thorough post-incident reviews to drive lasting fixes
Design and manage CI/CD pipelines using GitHub Actions
Contribute to evolving SRE standards, tooling, and practices across the organization

Benefits

Stock equity
Up to 100% employer-paid premiums for medical, dental, and vision coverage for employees and their dependents
Comprehensive wellness benefits and healthcare concierge services
401(k) plan
Company-paid life and disability insurance
Tax-advantaged spending accounts
Discounted voluntary offerings
Paid Parental Leave policy
Kindbody fertility and family-building benefits
Dedicated leave specialists
Generous annual stipends for both professional and personal development
Access to a wide range of internal learning opportunities
Flexible vacation policy
Paid holidays