Sr. Site Reliability Engineer

OpturaSan Francisco, CA
Remote

About The Position

Optura is healthcare’s AI orchestration platform. We help healthcare organizations transform disconnected AI pilots into a unified, enterprise-scale program that delivers measurable value. Our platform enables teams to design, execute, and monitor intelligent agents that drive automation, insights, and action, while providing the control and observability needed to scale safely. Built for real-world complexity, Optura supports multiple model providers, integrates seamlessly with existing infrastructure, and offers both SaaS and self-hosted options. Our mission: revolutionize how healthcare deploys and operationalizes AI in production. We’re looking for a Senior Platform Engineer to design, build, and operate the core services that power Optura’s AI Platform. In this role, you will own systems end-to-end. From model and agent orchestration to routing, reliability, and observability. You will partner closely with product and application teams to deliver secure, scalable, HIPAA-aware services. You will play a critical role in shaping the foundation that enables customers to safely deploy AI in real-world healthcare environments.

Requirements

  • 8+ years operating production infrastructure, including 3+ years in a senior SRE, platform, or staff infrastructure role
  • Deep Kubernetes expertise across managed (EKS, GKE, AKS) and self-managed/on-prem distributions — not just running it, but operating it at scale across heterogeneous environments
  • Multi-cloud fluency across AWS, GCP, and Azure, with informed opinions on when to abstract vs. embrace cloud-native primitives
  • Expert with Terraform (or Pulumi/Crossplane) and GitOps tooling
  • Experience shipping infrastructure that runs in customer environments — packaging, install/upgrade UX, air-gapped artifacts, support escalation paths
  • Strong networking, identity, and security fundamentals: VPC design, service mesh, mTLS, OIDC, KMS, secrets management
  • Production observability ownership (Prometheus, Grafana, OpenTelemetry, distributed tracing) and on-call leadership
  • A track record of writing real code — Go, Python, or similar — to extend the platform, not just configure it

Nice To Haves

  • Experience shipping HIPAA-regulated workloads, including BYOC or air-gapped customer deployments
  • Background with enterprise software delivery tooling (Replicated, Cluster API, Talos, Rancher, OpenShift)
  • Built internal developer platforms (Backstage, golden paths) that measurably reduced lead time for an engineering org
  • FinOps experience — driving meaningful cloud spend reductions through architecture, not just rightsizing
  • AI/ML infrastructure exposure: GPU scheduling, model-serving stacks, inference autoscaling
  • OSS contributions to infrastructure projects, or strong opinions formed running them at scale

Responsibilities

  • Architect and own Optura's multi-cloud infrastructure across AWS, GCP, and Azure — provisioning, networking, identity, observability, and cost governance
  • Design and operate Kubernetes platforms that run consistently across our cloud environments and inside customer environments, including BYOC and on-prem (potentially air-gapped) deployments
  • Build a unified deployment framework so Optura ships the same product to SaaS, BYOC, and on-prem customers without bespoke per-customer engineering — Helm charts, operators, install/upgrade tooling, and release pipelines
  • Own SLOs, capacity planning, incident response, and postmortems across the entire infrastructure stack; set the bar for operational readiness
  • Drive reliability and performance through error budgets, chaos testing, latency optimization, and disciplined runbook quality
  • Harden the platform for regulated deployments — HIPAA controls, tenant isolation, audit logging, RBAC, KMS, and secrets rotation
  • Lead the build-out of IaC, GitOps, and progressive delivery (Terraform, Argo CD, Crossplane) as the team's standard
  • Partner with engineering and security to set opinionated guardrails: golden paths, base images, policy-as-code, and CI/CD that the rest of the org adopts by default

Benefits

  • Health, dental, and vision insurance
  • Generous paid time off
  • Opportunities for professional growth and development
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service