AI Platform Engineer, Backend (Agentic Engineering)

Brain Co.•San Francisco, CA

About The Position

You'll join the team that builds and enables agentic workflows across Brain Co. For every engineer, operator, and business team internally, and for the production AI systems we deploy to governments, healthcare systems, and critical industries. This is a platform role at the center of the company's agent-first strategy: you'll build foundational systems used by every engineering team, and the bar is product-grade because the entire company depends on them.

Requirements

Have 5+ years building backend systems in production, with deep proficiency in at least one of Python, TypeScript, Go, or Rust.
Bring strong fundamentals in distributed systems: consistency, idempotency, retries, failure modes, queueing, scheduling.
Have designed and operated APIs and services that other engineers depend on.
Have a proven track record building shared infrastructure, internal platforms, or developer-facing services that real users adopted.
Have strong intuition for developer experience, long-term maintainability, and where to draw abstraction boundaries.
Are comfortable owning the full lifecycle: writing the design doc, shipping the MVP, hardening it, and driving adoption across the company.
Have owned services with real uptime and operational responsibility, and are comfortable with observability stacks, incident response, and SLOs.
Bring cloud-native experience: Kubernetes, infrastructure-as-code, OAuth/OIDC, secrets management.

Nice To Haves

Experience building or operating LLM infrastructure: gateways, inference systems, prompt routing, cost attribution, evaluation harnesses.
Experience with agent frameworks, tool-use systems, or sandboxed code execution.
Security instincts around prompt injection, supply-chain risk in agent ecosystems, and credential scoping for autonomous systems.
Background in multi-tenant, regulated, or government deployments (HIPAA, SOC2).
Open-source contributions to AI infrastructure, agent tooling, or developer platforms.

Responsibilities

Own the foundations of how LLMs are used across the company: cost visibility and controls, data privacy, identity and access, routing, and the security posture around all provider traffic.
Design the sandboxing, orchestration, audit, and guardrail layers that product teams build their agents on, so verticals don't need to invent their own abstraction.
Solve the hard problems: prompt-injection defenses, scoped credentials, kill switches, multi-tenant isolation (including VM-level pod isolation), and runaway-cost controls.
Design the orchestration, isolation, and resource models that make this viable: cold-start vs. always-on tradeoffs, credential and token lifecycle, fan-out and fan-in patterns, fairness and quota enforcement across tenants, and the observability needed to debug at that volume.
Make AI-assisted development a first-class platform layer: coding agents that review and ship code, automate CI, refactor at scale, and run as background workers across the codebase, together with the canonical scaffolding and guardrails that govern them.
Build the systems that let every team; engineering, operations, and the business, run their own agents reliably and safely against the tools they already use, with the right credentials, scheduling, memory, and audit underneath.
End-to-end ownership: architecture, implementation, rollout, observability, on-call, and iteration based on internal user feedback.
Partner closely with security, infrastructure, and product teams to make agent deployments safe by default.