About The Position

GRAIL's mission is to detect cancer early, when it can be cured, by changing the trajectory of cancer mortality and adopting innovative, safe, and effective technologies. As a healthcare company, GRAIL pioneers new technologies for early cancer detection, leveraging next-generation sequencing (NGS), population-scale clinical studies, and state-of-the-art computer and data science. Headquartered in the Bay Area of California, with locations in Washington, D.C., North Carolina, and the United Kingdom, GRAIL is supported by leading global investors and pharmaceutical, technology, and healthcare companies. GRAIL is seeking a Staff Site Reliability / DevOps Engineer to lead the reliability, scalability, and security of its cloud-native platform. This role combines infrastructure engineering, platform strategy, and organizational leadership, supporting systems for large-scale data processing and cutting-edge cancer detection technologies. The engineer will define and drive infrastructure standards, represent reliability and performance in architecture decisions, and build scalable systems. This is a highly technical, high-impact hybrid role, combining hands-on engineering with cross-functional influence and mentorship, based in either Menlo Park, CA (moving to Sunnyvale, CA in Fall 2026) or Durham, NC, requiring a minimum of 60% (24 hours) on-site work.

Requirements

  • BS in Computer Science, Engineering, or related field, or equivalent experience
  • 8+ years of experience in Site Reliability Engineering, DevOps, or platform engineering
  • Strong hands-on experience with at least one major cloud platform (AWS, GCP, or Azure)
  • Experience implementing infrastructure-as-code solutions (Terraform, CloudFormation, or similar)
  • Experience designing and operating CI/CD pipelines (e.g., GitLab CI, GitHub Actions, Jenkins)
  • Hands-on experience with Kubernetes and containerized systems in production environments
  • Proficiency in scripting or programming for automation (e.g., Python, Go, Bash, or PowerShell)
  • Experience with observability and monitoring tools (e.g., Prometheus, Grafana, OpenTelemetry, Datadog)
  • Strong understanding of networking, security, and distributed systems fundamentals
  • Experience working in regulated environments and familiarity with frameworks such as ISO 27001, NIST, SOC 2, or HIPAA

Nice To Haves

  • 10+ years of experience in SRE, DevOps, or infrastructure engineering
  • Experience operating multi-cluster Kubernetes environments (e.g., EKS, GKE) at scale
  • Familiarity with GitOps practices (e.g., ArgoCD, Flux)
  • Experience with data platforms and pipelines (e.g., Kafka, Airflow, Spark, Snowflake, BigQuery)
  • Experience implementing SLO/SLI frameworks and reliability practices across multiple teams
  • Strong background in cloud security, including IAM, zero-trust architecture, and secrets management
  • Experience with compliance-as-code and security tooling (e.g., OPA, Snyk, Checkov)
  • Exposure to AI/ML or large-scale data infrastructure workloads
  • Experience in healthcare, biotech, or other regulated industries
  • Relevant cloud or Kubernetes certifications (e.g., AWS DevOps, CKA/CKS, GCP DevOps)

Responsibilities

  • Design, build, and operate highly available, fault-tolerant cloud infrastructure across AWS, GCP, and/or Azure
  • Architect and maintain scalable CI/CD pipelines and deployment frameworks for enterprise-grade software delivery
  • Lead infrastructure-as-code adoption and maturity using tools such as Terraform, CloudFormation, and Ansible
  • Own Kubernetes reliability across multi-cluster environments, including upgrades, scaling, and workload lifecycle management
  • Establish and evolve observability platforms (metrics, logs, traces) and define SLO/SLI frameworks across teams
  • Lead incident response for critical outages, drive root cause analysis, and implement preventative improvements
  • Optimize infrastructure for cost, performance, and scalability, partnering closely with engineering and finance stakeholders
  • Define and enforce DevOps, reliability, and security best practices across the organization
  • Partner cross-functionally with engineering, data, QA, security, and IT teams to design resilient systems
  • Mentor engineers and contribute to technical leadership through design reviews, standards, and knowledge sharing

Benefits

  • flexible time-off or vacation
  • a 401(k) retirement plan with employer match
  • medical, dental, and vision coverage
  • carefully selected mindfulness programs
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service