Engineering Manager, Accelerator Platform

AnthropicSeattle, WA
13hHybrid

About The Position

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. About the Role Every time someone talks to Claude -- through the API, claude.ai, our cloud partners, or any of our expanding surfaces -- the request lands on an AI accelerator. Not one kind, many kinds: TPUs, Trainium chips, GPUs. Each arrives with its own software stack, performance characteristics, failure modes, and operational quirks. Someone has to take raw silicon and turn it into a platform that the rest of Anthropic can build on without thinking about which chip is underneath. That's us. The Accelerator Platform team owns the bringup and normalization of new hardware platforms for Anthropic's first party inference fleet. We sit between the low-level systems teams and the serving infrastructure that runs production inference -- bridging the gap so that every new accelerator generation ships as a first-class production platform. It's deeply technical work at the intersection of hardware enablement, distributed systems, and ML infrastructure, and it is directly on the critical path for Anthropic's compute strategy. We're hiring an Engineering Manager to build and lead this team. You'll inherit a small nucleus of experienced engineers and grow it into a standalone platform organization. You'll set technical direction, hire a strong team, and partner closely with hardware vendors, cloud providers, and teams across Inference to bring new accelerator generations online quickly and reliably.

Requirements

  • Have significant experience managing infrastructure or platform engineering teams (3+ years in engineering management)
  • Have deep technical fluency in systems programming, distributed systems, or hardware/software co-design -- you need to understand the stack deeply enough to make sound technical and hiring decisions
  • Have experience bringing up or operating heterogeneous compute infrastructure at scale -- whether that's GPU clusters, TPU pods, custom ASICs, or FPGA deployments.
  • Are comfortable with ambiguity and can build structure where none exists. This team is being carved out as a new entity; you'll be defining its charter, processes, and culture from scratch
  • Think strategically about hardware roadmaps and can translate vendor capabilities into engineering plans
  • Build strong cross-functional relationships -- this role requires tight collaboration with hardware vendors, cloud partners, and half a dozen internal teams
  • Care deeply about both technical excellence and the people doing the work.

Nice To Haves

  • Have direct experience with ML accelerator architectures (GPU/CUDA, TPU/XLA, Trainium/Neuron, or similar)
  • Have worked on ML inference serving infrastructure at scale (1000+ accelerators)
  • Have experience with Kubernetes-based ML workload orchestration
  • Understand ML-specific networking (RDMA, InfiniBand, NVLink, ICI) and how interconnect topology affects serving performance
  • Have experience managing vendor relationships and influencing hardware/software roadmaps
  • Have led teams through rapid growth phases (hiring 5+ engineers in a short timeframe).

Responsibilities

  • Build and lead the Accelerator Platform team -- hiring, developing, and retaining engineers who thrive at the hardware/software boundary
  • Own the end-to-end bring-up lifecycle for new accelerator platforms (multiple generations of Trainium, TPUs, and GPUs), from initial silicon availability through production-ready inference
  • Define and drive the platform normalization layer -- ensuring new hardware integrates cleanly with Anthropic's inference serving stack to provide a consistent abstractio
  • Partner with cloud providers (AWS, GCP, Microsoft Azure) and chip vendors on hardware roadmaps, capacity planning, and platform-specific technical challenges
  • Collaborate closely with teams across Inference and Infrastructure to ensure new platforms meet production reliability and latency requirements from day one
  • Contribute to Anthropic's multi-cloud compute strategy -- helping the organization maintain optionality across accelerator families and avoid lock-in to any single vendor
  • Manage the team's priorities across competing demands: new platform bring-up, ongoing production support for existing platforms, and longer-term investments in tooling and automation.

Benefits

  • competitive compensation and benefits
  • optional equity donation matching
  • generous vacation and parental leave
  • flexible working hours
  • a lovely office space in which to collaborate with colleagues
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service