Staff Site Reliability Engineer (SRE) | Dev Ops Engineer

Grail•Menlo Park, CA

64d•Hybrid

About The Position

Our mission is to detect cancer early, when it can be cured. We are working to change the trajectory of cancer mortality and bring stakeholders together to adopt innovative, safe, and effective technologies that can transform cancer care. We are a healthcare company, pioneering new technologies to advance early cancer detection. We have built a multi-disciplinary organization of scientists, engineers, and physicians and we are using the power of next-generation sequencing (NGS), population-scale clinical studies, and state-of-the-art computer science and data science to overcome one of medicine’s greatest challenges. GRAIL is headquartered in the bay area of California, with locations in Washington, D.C., North Carolina, and the United Kingdom. It is supported by leading global investors and pharmaceutical, technology, and healthcare companies. GRAIL is seeking a Staff Site Reliability / DevOps Engineer to lead the reliability, scalability, and security of our cloud-native platform. This role operates at the intersection of infrastructure engineering, platform strategy, and organizational leadership, supporting systems that power large-scale data processing and cutting-edge cancer detection technologies. You will define and drive infrastructure standards across teams, represent reliability and performance in architecture decisions, and build systems that scale well beyond your direct ownership. This is a highly technical, high-impact role combining hands-on engineering with cross-functional influence and mentorship.

Requirements

BS in Computer Science, Engineering, or related field, or equivalent experience
8+ years of experience in Site Reliability Engineering, DevOps, or platform engineering
Strong hands-on experience with at least one major cloud platform (AWS, GCP, or Azure)
Experience implementing infrastructure-as-code solutions (Terraform, CloudFormation, or similar)
Experience designing and operating CI/CD pipelines (e.g., GitLab CI, GitHub Actions, Jenkins)
Hands-on experience with Kubernetes and containerized systems in production environments
Proficiency in scripting or programming for automation (e.g., Python, Go, Bash, or PowerShell)
Experience with observability and monitoring tools (e.g., Prometheus, Grafana, OpenTelemetry, Datadog)
Strong understanding of networking, security, and distributed systems fundamentals
Experience working in regulated environments and familiarity with frameworks such as ISO 27001, NIST, SOC 2, or HIPAA

Nice To Haves

10+ years of experience in SRE, DevOps, or infrastructure engineering
Experience operating multi-cluster Kubernetes environments (e.g., EKS, GKE) at scale
Familiarity with GitOps practices (e.g., ArgoCD, Flux)
Experience with data platforms and pipelines (e.g., Kafka, Airflow, Spark, Snowflake, BigQuery)
Experience implementing SLO/SLI frameworks and reliability practices across multiple teams
Strong background in cloud security, including IAM, zero-trust architecture, and secrets management
Experience with compliance-as-code and security tooling (e.g., OPA, Snyk, Checkov)
Exposure to AI/ML or large-scale data infrastructure workloads
Experience in healthcare, biotech, or other regulated industries
Relevant cloud or Kubernetes certifications (e.g., AWS DevOps, CKA/CKS, GCP DevOps)

Responsibilities

Design, build, and operate highly available, fault-tolerant cloud infrastructure across AWS, GCP, and/or Azure
Architect and maintain scalable CI/CD pipelines and deployment frameworks for enterprise-grade software delivery
Lead infrastructure-as-code adoption and maturity using tools such as Terraform, CloudFormation, and Ansible
Own Kubernetes reliability across multi-cluster environments, including upgrades, scaling, and workload lifecycle management
Establish and evolve observability platforms (metrics, logs, traces) and define SLO/SLI frameworks across teams
Lead incident response for critical outages, drive root cause analysis, and implement preventative improvements
Optimize infrastructure for cost, performance, and scalability, partnering closely with engineering and finance stakeholders
Define and enforce DevOps, reliability, and security best practices across the organization
Partner cross-functionally with engineering, data, QA, security, and IT teams to design resilient systems
Mentor engineers and contribute to technical leadership through design reviews, standards, and knowledge sharing