Senior Site Reliability Engineer

DrataSan Francisco, CA
Hybrid

About The Position

Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a close-knit SRE team where you grow your career, shape standards, and collaborate with peers - while also serving as the dedicated reliability partner for one of Drata's product engineering teams across the full lifecycle of their work. This is a highly technical role at the intersection of software engineering and systems engineering. The best SREs at Drata are engineers first: they solve problems by building solutions, not by executing manual processes. Automation is a core value, and nowhere is that more visible than in how we approach reliability. Our infrastructure runs on AWS across multiple accounts, defined entirely in Terraform. You'll work across a modern cloud-native stack to help Drata scale reliably for a rapidly growing customer base.

Requirements

  • 6+ years of experience in Site Reliability Engineering, Cloud Engineering, or building and maintaining scalable, resilient services
  • Robust knowledge of cloud computing technologies: Terraform, Docker, Git, and Linux
  • Hands-on experience with Datadog for monitoring, alerting, dashboards, SLO tracking, and distributed tracing
  • Experience building software systems as a software engineer
  • Experience developing tooling and automation in Python and/or Bash
  • Experience with CI/CD pipeline automation, specifically GitHub Actions
  • Experience with disaster recovery practices and incident management
  • Strong understanding of observability concepts - monitoring, logging, distributed tracing, and metrics - and how to apply them to production systems
  • Experience with container orchestration and deployment technologies including AWS ECS Fargate and/or Kubernetes
  • Experience working with relational databases (MySQL proficiency is a plus)
  • Ability to take ownership of problems and act on them independently in a constantly evolving environment
  • Hands-on experience using AI-assisted development tools (e.g., GitHub Copilot, Cursor, or similar) to accelerate automation, scripting, or infrastructure work
  • Demonstrated use of AI/AIOps capabilities for reliability tasks - anomaly detection, incident triage, runbook generation, or alert noise reduction
  • Familiarity with the operational characteristics of AI/ML-backed services and what it means to make them observable and reliable in production
  • Demonstrated passion for AI through personal projects, contributions, or continuous learning in the context of infrastructure or reliability engineering

Nice To Haves

  • Experience with AIOps - using AI/ML-based tooling for anomaly detection, predictive alerting, or automated incident triage
  • Familiarity with the reliability characteristics of AI/ML-backed services (e.g., LLM inference latency, non-determinism, prompt pipeline observability)
  • Experience with the JavaScript/Node.js ecosystem
  • Certified Kubernetes Administrator (CKA) certification
  • Familiarity with compliance frameworks like SOC 2, ISO 27001, or NIST

Responsibilities

  • Reliability Architecture for Your Product Team: You are the reliability expert for your aligned product team. You engage early - during architecture reviews and design discussions - to surface risks before they become incidents.
  • Lead Production Readiness Reviews (PRRs) before new services launch, with the authority to flag gaps and gate launches when critical reliability standards aren't met
  • Partner with product engineering leads and staff engineers to define SLOs and SLIs for critical services, turning reliability from a vague goal into a measurable commitment
  • Participate in team planning and architecture reviews to provide proactive reliability guidance
  • Build reusable artifacts - SLO templates, observability checklists, alerting standards, reference dashboards - that raise the reliability floor across the team, not just the services you touch directly
  • Eliminating Toil Through Engineering: You handle operational needs from your product team, but your job isn't to be a help desk. Your goal is to make each request the last of its kind. When an engineer needs something, your priority is: automate it so anyone can do it → document it so the team can self-serve → execute it manually only as a last resort.
  • Build and maintain Datadog monitors, dashboards, and alert routing - enforcing infrastructure-as-code standards via Terraform so those resources are owned, versioned, and auditable
  • Handle infrastructure requests: ECS task management, secret rotations, Terraform changes, capacity adjustments
  • Identify repeated manual work and convert it into self-service tooling or runbooks
  • Audit existing services for reliability anti-patterns and surface top risks before they cause incidents
  • Central SRE Platform Work: Beyond your product team, you contribute to cross-cutting infrastructure, tooling, and standards that benefit every team at Drata. Recent examples include automated Datadog governance workflows, dynamic AWS account provisioning, and disaster recovery exercises.
  • Design and build shared platform infrastructure - reusable Terraform modules, standardized observability stacks, service templates - so reliability improvements compound across the organization
  • Participate in the on-call rotation and lead incident response when needed; conduct thorough post-incident reviews to drive lasting fixes
  • Design and manage CI/CD pipelines using GitHub Actions
  • Contribute to evolving SRE standards, tooling, and practices across the organization

Benefits

  • Stock equity
  • Up to 100% employer-paid premiums for medical, dental, and vision coverage for employees and their dependents
  • Comprehensive wellness benefits and healthcare concierge services
  • 401(k) plan
  • Company-paid life and disability insurance
  • Tax-advantaged spending accounts
  • Discounted voluntary offerings
  • Paid Parental Leave policy
  • Kindbody fertility and family-building benefits
  • Dedicated leave specialists
  • Generous annual stipends for both professional and personal development
  • Access to a wide range of internal learning opportunities
  • Flexible vacation policy
  • Paid holidays
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service