Staff Platform Engineer

Citizen Health•San Francisco, CA

2d•Remote

About The Position

We're hiring a Staff Platform Engineer to build and own the infrastructure that powers Citizen Health — the cloud platform, deployment systems, and reliability practices that let our teams ship AI-driven features safely at speed. You'll partner with teams across AI, web, mobile, and data engineering to keep our platform performant, secure, and compliant with healthcare standards (HIPAA, SOC 2, and adjacent frameworks). This is a hands-on technical leadership role. You'll write code, set the direction for platform engineering, and raise the bar for how the entire org runs in production. You will likely be the first dedicated infrastructure hire and will build the platform engineering function from the ground up — owning everything from Terraform to on-call until you grow the team.

Requirements

8+ years of software / infrastructure engineering experience; 4+ years in a senior IC or technical leadership role on a Platform, SRE, or DevOps team.
Deep expertise running production systems on a major cloud provider (AWS, GCP, or Azure).
Hands-on with Kubernetes, infrastructure-as-code (Terraform, Pulumi, or similar), and modern CI/CD tooling (GitHub Actions, ArgoCD, Spinnaker, etc.).
Strong programming skills in at least one of Go, Python, or TypeScript — you build tooling, not just configure it.
Demonstrated experience scaling infrastructure through a significant growth inflection (10x users, new product surface, architectural migration) in a production environment.
Proven track record designing for high availability, zero-downtime releases, and graceful degradation in regulated or data-intensive environments.
Strong security fundamentals: IAM, secrets management, network segmentation, vulnerability management, supply-chain security.
Experience at an early-stage or high-growth startup where you owned infrastructure broadly, not just a narrow slice.
Excellent written and verbal communication — you're equally effective in a post-mortem, a roadmap discussion, and a Slack thread.

Nice To Haves

Experience supporting AI/ML workloads in production — model serving, inference optimization, GPU infrastructure, vector databases, or large-scale data pipelines. This is close to a must-have given our architecture.
Familiarity with agentic AI systems, multi-agent orchestration, or long-running autonomous workflows.
Experience with messaging infrastructure at scale — message queues, webhook reliability, delivery guarantees, rate limiting (WhatsApp, SMS, or similar).
Background in healthtech, life sciences, fintech, or other regulated domains (HIPAA, SOC 2, HITRUST, FedRAMP).
Familiarity with FHIR / HL7 or other healthcare data standards.
Experience growing a Platform or SRE function from a small group of generalists into a specialized team.

Responsibilities

Platform & infrastructure strategy — Set the technical direction for our agentic platform. Design architectures that handle long-running AI agent sessions, real-time WhatsApp message delivery, and autonomous tool execution (browser, phone, API calls) — all within HIPAA boundaries. Make the right build-vs-buy calls and keep infrastructure cost-efficient as we scale from hundreds to tens of thousands of active patients.
Reliability & observability — Define SLOs, build the observability stack, drive incident response, and keep uptime boringly high. Make on-call sustainable. When something breaks at 2am, you fix it, write the post-mortem, and spend the next sprint making sure it never happens again — without being asked.
CI/CD & developer experience — Own the deployment pipeline end-to-end. Eliminate friction in the inner and outer loops so teams can ship to production dozens of times a day with confidence. Our AI engineers should be thinking about models and prompts, not fighting deploys.
Inference cost & capacity management — AI inference is our largest variable cost. Partner with AI engineering to optimize model serving, manage GPU and compute capacity, negotiate vendor contracts, and make sure our unit economics work as we scale.
Security & compliance — Partner with security and legal to maintain HIPAA, SOC 2, and related regulatory standards. Implement controls that protect patient data without slowing teams down. Secure the agent tool-use pipeline — when our AI agent opens a browser or makes a phone call on behalf of a patient, the blast radius has to be contained.
Cross-functional partnership — Work closely with Product, AI/ML, Data, and Security to translate platform needs into roadmap. Unblock teams and mentor engineers across the org.