Senior Platform AI Engineer

Drata•San Francisco, CA

3d•$192,000 - $259,800•Hybrid

About The Position

Drata's AI Platform team builds the production infrastructure that powers AI features across our compliance platform — from MCP servers that make Drata's data available to AI agents, to LLM workflow orchestration that automates SOC 2, TPRM, and policy analysis. You'll own the systems that sit between our AI models and our customers: tool definitions that agents actually understand, deployment pipelines that handle model upgrades without breaking output quality, and orchestration layers that manage multi-step agent workflows with persistent state. This is not a traditional infrastructure role. You'll debug prompt templates alongside Terraform modules. You'll design API schemas optimized for LLM token budgets, not just HTTP throughput. When a model upgrade changes behavior across 15 workflows, you'll assess quality impact — not just confirm the containers are healthy. You'll work closely with our agent developers, product engineers, and an embedded SRE partner, sitting at the intersection of AI development and production reliability. Our north star is simple: minimize the time it takes to launch a new agent in production. You're someone who asks "are we solving the right problem?" before writing the first line of code, who builds systems that make five other engineers faster, not just yourself, and who's equally proud of what they chose not to build.

Requirements

7+ years of software engineering experience, with 2+ years building or operating AI/ML infrastructure in production.
Strong in Python (our AI services are built in Python).
Experience with LLM APIs, vector databases, or AI orchestration platforms and understand the difference between "the service is up" and "the model output is good."
Comfortable across the stack: writing Terraform one day, debugging a prompt template the next, and designing an agent orchestration framework the day after.
Experience in several of these areas: cloud infrastructure (AWS preferred — ECS, S3, Bedrock), container orchestration, infrastructure-as-code, CI/CD pipeline design, API design, workflow orchestration engines, and distributed systems.
Worked with at least some AI-specific tooling: LLM APIs (Claude, OpenAI, etc), model serving frameworks (vLLM, SageMaker etc), vector databases, embedding pipelines, prompt management platforms, or agent frameworks.
Communicate clearly about technical tradeoffs, especially when explaining AI-specific infrastructure decisions to stakeholders who think in terms of traditional reliability engineering.
Own what you see broken, not just what's assigned to you, and you can spot when an architecture decision will fail at scale and say so early, clearly, and with an alternative.

Nice To Haves

TypeScript/Node.js

Responsibilities

Design and build MCP (Model Context Protocol) servers that expose Drata's platform to AI agents. This means making architectural decisions about tool granularity, naming conventions for agent disambiguation, response compression for LLM context windows, and workspace isolation for multi-tenant access. You'll own the protocol layer that determines whether agents can reliably find and use the right tools — writing semantic parameter descriptions, contextual hints, and tool schemas that optimize for model comprehension, not just developer ergonomics.
Build and operate the infrastructure for deploying multi-step agent workflows — state management across complex reasoning chains, tool routing and execution runtimes, and long-running agentic processes that persist over time. Own the orchestration layer that coordinates agent planning, tool calls, and human-in-the-loop patterns. Design systems that handle agent failure modes gracefully: retries on ambiguous tool outputs, fallback strategies when models produce unexpected results, and observability into multi-step execution traces.
Own the operational side of our LLM workflows: model upgrades across production pipelines (assessing behavior changes, not just version bumps), prompt versioning and A/B testing, AI workflow deployment with custom container compatibility, and output quality monitoring.
Manage token capacity planning — understanding model costs, context limits, batching strategies, and rate governance across workflows. When an AI workflow fails, you'll investigate whether it's a prompt template issue, a model behavior change, or an infrastructure problem. Making that distinction requires understanding both systems.
Operate and evolve our production AI stack: vector storage and indexing (designing chunking strategies and metadata schemas for retrieval quality), document parsing pipelines, multi-region deployment, and cost optimization across LLM providers. You'll make RAG architecture decisions — embedding strategies, retrieval filtering, data model coordination — where the engineering challenge is search quality, not just system uptime. Implement caching layers and token-aware request routing to manage spend as AI workloads scale.
Build CI/CD patterns specific to AI workflows (reproducible deployments, SDK version compatibility, workflow rollback semantics). Own AI-specific observability — token usage dashboards, response quality metrics, agent execution traces, and cost-per-workflow tracking alongside traditional infrastructure monitoring. Enable product engineering teams to ship AI features faster by providing reliable, well-documented platform primitives.

Benefits

Stock equity
Up to 100% employer-paid premiums for medical, dental, and vision coverage for employees and their dependents
Comprehensive wellness benefits and healthcare concierge services
401(k) plan
Company-paid life and disability insurance
Tax-advantaged spending accounts
Discounted voluntary offerings
Paid Parental Leave policy (after six months of employment)
Kindbody fertility and family-building benefits
Dedicated leave specialists
Generous annual stipends for both professional and personal development
Access to a wide range of internal learning opportunities
Flexible vacation policy
Paid holidays