SR DevOps Engineer

Qualified Health PBCPalo Alto, CA
$170,000 - $220,000Hybrid

About The Position

We're looking for a Senior DevOps Engineer / Site Reliability Engineer to ensure the reliability, performance, and operational excellence of our production environments powering AI solutions for major health systems. You'll partner closely with engineering teams to make services production-ready, own observability and incident response, and drive the practices that keep our platform stable as we scale. As a key member of our infrastructure team, you'll be the connective tissue between development and production, ensuring new features ship safely while maintaining the reliability standards required for healthcare workloads.

Requirements

  • 6+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering, with at least 3 years directly managing production workloads
  • Strong proficiency with Terraform including module development, state management, and multi-environment architectures
  • Deep experience operating production Kubernetes environments, including troubleshooting, networking, workload management, and cluster operations
  • Hands-on experience with both Google Cloud Platform and Microsoft Azure services
  • Strong networking and security knowledge, including zero trust architectures, network segmentation, private connectivity, identity-based access controls, and secrets management
  • Production experience with Temporal or comparable workflow orchestration systems
  • Strong proficiency in Python for automation, tooling, and operational scripting
  • Demonstrated experience designing and operating observability stacks including metrics, logging, tracing, and alerting
  • Experience leading incident response, including on-call rotation management, runbook development, and postmortem processes
  • Track record of partnering with engineering teams to improve production readiness and release practices
  • Excellent written communication skills for authoring runbooks, postmortems, and release documentation
  • Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience

Nice To Haves

  • Experience in healthcare industry with understanding of HIPAA compliance requirements
  • Familiarity with HITRUST or similar compliance frameworks
  • Experience operating LLM-based systems, agentic workflows, or RAG pipelines in production
  • Experience with GitOps workflows (Rancher Fleet, ArgoCD, or Flux)
  • Experience building and operating multi-tenant SaaS infrastructure
  • Familiarity with chaos engineering and reliability testing practices
  • Prior experience as a founding or early SRE/Platform hire at a startup

Responsibilities

  • Partner with engineering teams to ensure services are production-ready before release, including reviewing deployment patterns, failure modes, resource requirements, and rollback strategies
  • Design and maintain observability infrastructure including metrics, logging, distributed tracing, and dashboards across multi-cloud environments
  • Define and manage alerting policies, SLIs/SLOs, and on-call rotations to ensure timely response to production issues
  • Lead and support incident response for production issues, drive root cause analysis, and coordinate hotfix deployments when needed
  • Author and maintain release documentation, runbooks, incident postmortems, and operational playbooks
  • Provide day-to-day operational support to engineering teams, unblocking deployments, debugging production issues, and improving developer experience around shipping to production
  • Design and maintain zero trust network architectures, ensuring secure connectivity across multi-cloud environments and tenant boundaries
  • Build and improve CI/CD pipelines and release processes to make production deployments safer, faster, and more predictable
  • Develop automation in Python and Terraform to reduce toil and codify operational best practices
  • Manage Kubernetes-based workloads in production, including troubleshooting cluster issues, optimizing resource utilization, and maintaining workload reliability
  • Operate Temporal workflows in production, including monitoring, scaling, and troubleshooting long-running workflow executions
  • Collaborate with security and compliance teams to maintain HIPAA and HITRUST controls across production environments

Benefits

  • competitive salaries with equity packages
  • robust medical/dental/vision insurance
  • flexible working hours
  • hybrid work options
  • inclusive environment that fosters creativity and innovation
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service