Platform Engineer, AI Platform

Optura•San Francisco, CA

46d•Remote

About The Position

Optura is healthcare’s AI orchestration platform. We help healthcare organizations transform disconnected AI pilots into a unified, enterprise-scale program that delivers measurable value. Our platform enables teams to design, execute, and monitor intelligent agents that drive automation, insights, and action, while providing the control and observability needed to scale safely. Built for real-world complexity, Optura supports multiple model providers, integrates seamlessly with existing infrastructure, and offers both SaaS and self-hosted options. Our mission: revolutionize how healthcare deploys and operationalizes AI in production. We’re looking for a Senior Platform Engineer to design, build, and operate the core services that power Optura’s AI Platform. In this role, you will own systems end-to-end. From model and agent orchestration to routing, reliability, and observability. You will partner closely with product and application teams to deliver secure, scalable, HIPAA-aware services. You will play a critical role in shaping the foundation that enables customers to safely deploy AI in real-world healthcare environments.

Requirements

5+ years of software engineering experience with strong proficiency in Python and TypeScript
2+ years of experience operating AI systems in production (agentic workflows, RAG, orchestration, or similar)
Experience with operating in Cloud environments, including the use of containers/Kubernetes (EKS or ECS) and Terraform
Experience designing and operating distributed systems with a focus on performance optimization and deep debugging
Experience with observability systems (metrics, tracing, logging) and on-call ownership

Nice To Haves

Experience working in healthcare or other regulated industries, including HIPAA or PHI-handling practices
Experience with LLMOps, including prompt management, evaluation frameworks, guardrails, and cost and latency tuning
Experience building or operating model gateways, traffic shaping, multi-provider routing, and caching at scale

Responsibilities

Build core platform services in Python and TypeScript for orchestration, routing, model gateways, retrieval-augmented generation (RAG), and evaluation pipelines
Leverage AI-assisted development tools (e.g., Claude, Cursor) alongside tests, linters, and benchmarks to improve velocity and quality
Own services from design through deployment, including SLO creation, dashboards, runbooks, and operational readiness
Improve reliability by optimizing system latency, availability, performance, and cost; lead and participate in incident response and postmortems
Develop production AI capabilities including guardrails, prompt and version management, offline and online evaluations, and multi-provider integrations
Build and maintain data and storage systems including vector search (pgvector, Pinecone, OpenSearch), caching, and Postgres/RDS patterns
Implement security and compliance best practices aligned to HIPAA, including RBAC, audit logging, least-privilege access, and secrets management