Senior Site Reliability Engineer

CertifyOS

About The Position

We’re looking for a Senior Site Reliability Engineer who takes ownership seriously — someone who designs for reliability, ships the automation, and stands behind it in production. You’ll work across cloud-native infrastructure on systems that process millions of provider records. This is a role with real scope: you’ll own the operational lifecycle end-to-end and influence platform architecture, reliability standards, and deployment workflows across systems that matter. SREs at CertifyOS own the full lifecycle of what they support — from infrastructure design and deployment automation through observability, incident response, and postmortems. We use AI-assisted tooling aggressively to reduce toil and accelerate troubleshooting, which raises the floor on the problems we tackle — not an excuse to reduce rigor. If you do your best work reacting to incidents, this probably isn’t the right fit. If you do your best work preventing them, we should talk. Healthcare provider data infrastructure is a distributed systems problem at scale. Hundreds of upstream integrations, inconsistent data sources, and evolving workloads all introduce operational complexity and reliability risk. Problems You’ll Solve: Reliability and observability at scale. You’re operating a platform hundreds of integrations depend on. How do you maintain uptime, reduce alert fatigue, and build actionable observability across GKE and Cloud Run without drowning in noise? Meaningful SLIs, error budgets, and data quality signals — not just p99 latency. Scaling infrastructure efficiently. As platform usage grows, infrastructure costs and operational complexity grow with it. You’ll improve autoscaling behavior, resource utilization, and workload efficiency across cloud-native distributed systems. Incident response and operational maturity. Production incidents are inevitable; operational chaos is optional. You’ll own incident response processes, root cause analysis, escalation workflows, and runbooks — and make hard problems not happen again. Infrastructure automation and developer velocity. You’ll build and maintain Infrastructure as Code, CI/CD pipelines, and operational tooling that reduce manual work and improve engineering productivity without sacrificing reliability. Reliability engineering for data platforms. Uptime isn’t enough — you need to know when a provider record is stale, a pipeline is lagging, or a workload is behaving unexpectedly. You’ll instrument data freshness and infrastructure health, not just service uptime.

Requirements

5+ years in SRE, DevOps, Platform Engineering, or Infrastructure Engineering — operating production systems at scale where your infrastructure is someone else’s dependency and failures have real downstream consequences
Track record of improving reliability end-to-end: you’ve debugged hard production problems, made them not happen again, and built the alerting to prove it
Strong Linux systems administration, incident response, and root cause analysis skills
Comfort influencing operational standards and mentoring teams on reliability practices
Deep hands-on experience with GCP — GKE, Cloud Run, and containerized workloads at scale
Experience building and maintaining Infrastructure as Code with Terraform and/or Pulumi
Fluency across deployment patterns and the judgment to know when each fits: rolling deployments, blue/green, canary — and the rollback story for each
Experience with autoscaling, resource optimization, and infrastructure efficiency for distributed systems
Experience managing infrastructure security, secrets, and access controls in regulated or security-conscious environments
Strong understanding of Golden Signals monitoring — latency, traffic, errors, saturation — and how to make them actionable rather than noisy
Experience designing SLIs, SLOs, error budgets, alerting strategies, dashboards, and escalation workflows
Hands-on experience with observability platforms: Google Cloud Monitoring, Datadog, Grafana, Prometheus, or similar
Strong sense of data platform health: lineage, freshness, and correctness matter as much to you as throughput
Experience building and maintaining CI/CD pipelines using GitHub Actions or similar
Scripting or programming fluency in Python, Bash, Go, or similar — you reduce toil through code, not process
Experience working with Git workflows and modern software delivery practices
Strong written and verbal communication — you can explain an operational risk to an engineer and a product manager in the same conversation
Experience operating systems handling sensitive data or PII in regulated or compliance-adjacent environments

Nice To Haves

Experience operating large-scale distributed systems or microservices architectures
Familiarity with healthcare, credentialing, or health-tech environments
Experience leveraging AI-assisted observability or incident response tooling
Familiarity with NodeJS, TypeScript, Java, or React application stacks

Responsibilities

Own the operational lifecycle end-to-end
Influence platform architecture, reliability standards, and deployment workflows
Design for reliability
Ship the automation
Stand behind automation in production
Own the full lifecycle of what they support — from infrastructure design and deployment automation through observability, incident response, and postmortems
Maintain uptime, reduce alert fatigue, and build actionable observability
Improve autoscaling behavior, resource utilization, and workload efficiency
Own incident response processes, root cause analysis, escalation workflows, and runbooks
Build and maintain Infrastructure as Code, CI/CD pipelines, and operational tooling
Instrument data freshness and infrastructure health

Benefits

100% coverage of health, dental, and vision insurance premiums for employees
Unlimited PTO, with at least two weeks off each year to recharge (for US-based team)
Health insurance, statutory leave benefits, and additional wellness (menstrual) leave for women (for India employees)

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume