SR DevOps Engineer

Qualified Health PBC•Palo Alto, CA

6d•$170,000 - $220,000•Hybrid

About The Position

We're looking for a Senior DevOps Engineer / Site Reliability Engineer to ensure the reliability, performance, and operational excellence of our production environments powering AI solutions for major health systems. You'll partner closely with engineering teams to make services production-ready, own observability and incident response, and drive the practices that keep our platform stable as we scale. As a key member of our infrastructure team, you'll be the connective tissue between development and production, ensuring new features ship safely while maintaining the reliability standards required for healthcare workloads.

Requirements

6+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering, with at least 3 years directly managing production workloads
Strong proficiency with Terraform including module development, state management, and multi-environment architectures
Deep experience operating production Kubernetes environments, including troubleshooting, networking, workload management, and cluster operations
Hands-on experience with both Google Cloud Platform and Microsoft Azure services
Strong networking and security knowledge, including zero trust architectures, network segmentation, private connectivity, identity-based access controls, and secrets management
Production experience with Temporal or comparable workflow orchestration systems
Strong proficiency in Python for automation, tooling, and operational scripting
Demonstrated experience designing and operating observability stacks including metrics, logging, tracing, and alerting
Experience leading incident response, including on-call rotation management, runbook development, and postmortem processes
Track record of partnering with engineering teams to improve production readiness and release practices
Excellent written communication skills for authoring runbooks, postmortems, and release documentation
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience

Nice To Haves

Experience in healthcare industry with understanding of HIPAA compliance requirements
Familiarity with HITRUST or similar compliance frameworks
Experience operating LLM-based systems, agentic workflows, or RAG pipelines in production
Experience with GitOps workflows (Rancher Fleet, ArgoCD, or Flux)
Experience building and operating multi-tenant SaaS infrastructure
Familiarity with chaos engineering and reliability testing practices
Prior experience as a founding or early SRE/Platform hire at a startup

Responsibilities

Partner with engineering teams to ensure services are production-ready before release, including reviewing deployment patterns, failure modes, resource requirements, and rollback strategies
Design and maintain observability infrastructure including metrics, logging, distributed tracing, and dashboards across multi-cloud environments
Define and manage alerting policies, SLIs/SLOs, and on-call rotations to ensure timely response to production issues
Lead and support incident response for production issues, drive root cause analysis, and coordinate hotfix deployments when needed
Author and maintain release documentation, runbooks, incident postmortems, and operational playbooks
Provide day-to-day operational support to engineering teams, unblocking deployments, debugging production issues, and improving developer experience around shipping to production
Design and maintain zero trust network architectures, ensuring secure connectivity across multi-cloud environments and tenant boundaries
Build and improve CI/CD pipelines and release processes to make production deployments safer, faster, and more predictable
Develop automation in Python and Terraform to reduce toil and codify operational best practices
Manage Kubernetes-based workloads in production, including troubleshooting cluster issues, optimizing resource utilization, and maintaining workload reliability
Operate Temporal workflows in production, including monitoring, scaling, and troubleshooting long-running workflow executions
Collaborate with security and compliance teams to maintain HIPAA and HITRUST controls across production environments