Staff / Senior Software Engineer, Compute Capacity

Anthropic•San Francisco, CA

8d•$320,000 - $405,000•Hybrid

About The Position

Anthropic manages one of the largest and fastest-growing accelerator fleets in the industry — spanning multiple accelerator families and clouds. The Accelerator Capacity Engineering (ACE) team is responsible for making sure every chip in that fleet is accounted for, well-utilized, and efficiently allocated. We own the data, tooling, and operational systems that let Anthropic plan, measure, and maximize utilization across first-party and third-party compute. As an engineer on ACE, you will build the production systems that power this work: data pipelines that ingest and normalize telemetry from heterogeneous cloud environments, observability tooling that gives the org real-time visibility into fleet health, and performance instrumentation that measures how efficiently every major workload uses the hardware it’s running on. You will be expected to write production-quality code every day, operate alongside Kubernetes-native infrastructure at meaningful scale, and directly influence decisions around one of Anthropic’s largest areas of spend. You’ll collaborate closely with research engineering, infrastructure, inference, and finance teams. The work requires someone who can move between data engineering, systems engineering, and observability with comfort — and who thrives in a high-autonomy, high-ambiguity environment.

Requirements

5+ years of software engineering experience with a strong track record building and operating production systems. You write code every day — this is a hands-on engineering role, not a planning or coordination role.
Kubernetes fluency at operational depth — you’ve operated production K8s at meaningful scale, not just written manifests. Comfort with scheduling, taints, labels, node management, and debugging cluster-level issues.
Data pipeline engineering experience — designing, building, and owning the full lifecycle of production data pipelines. Experience with data warehouses (BigQuery preferred), schema management, streaming ingestion, SLOs for latency and completeness, and a strong instinct for correctness.
Observability tooling experience — Prometheus, PromQL, and Grafana are in the critical path for this team. Experience writing recording rules, understanding metric semantics, and building monitoring systems that engineering teams actually rely on.
Python and SQL at production quality. Most pipeline code is Python; the presentation layer is BigQuery SQL including table-valued functions and views. Both need to be idiomatic, well-tested, and maintainable.
Familiarity with at least one major cloud provider (AWS, GCP, or Azure) at the infrastructure level — compute, billing, usage APIs, cost management tooling. Multi-cloud experience is a strong plus.
High autonomy and strong cross-team communication. You can gather your own requirements, navigate ambiguity, and work across organizational boundaries. Scrappiness and ownership matter more than polish.

Nice To Haves

Multi-cloud data ingestion experience — especially working with AWS and GCP APIs, billing exports, or vendor-specific telemetry formats. Experience normalizing data from external providers with different billing arrangements is directly applicable.
Accelerator infrastructure familiarity — GPU metrics (DCGM), TPU utilization, Trainium power and utilization metrics, or experience working with ML training/inference systems at the hardware level.
Performance engineering and benchmarking experience — building benchmark harnesses, establishing baselines, reasoning about compute efficiency (FLOPs utilization, memory bandwidth, interconnect throughput), and working with system teams to diagnose and improve performance.
Data-as-product thinking — experience building internal data products with self-service access, schema contracts, API serving, documentation, and discoverability. Not just building pipelines, but thinking about how platform data gets consumed.
Experience with capacity planning, resource management, or cost attribution systems at a hyperscaler or large-scale ML environment. FinOps, chargeback systems, or infrastructure cost modeling.
Familiarity with ClickHouse, Terraform, or Rust. ClickHouse is the team’s current streaming store; Terraform for infrastructure-as-code; Rust for high-performance data collection agents.

Responsibilities

Build and operate data pipelines that ingest accelerator occupancy, utilization, and cost data from multiple cloud providers into BigQuery. Own data completeness, latency SLOs, gap detection, and backfill automation.
Develop and maintain observability infrastructure — Prometheus recording rules, Grafana dashboards, and alerting systems — that surface actionable signals about fleet health, occupancy, and efficiency.
Instrument and analyze compute efficiency metrics across training, inference, and eval workloads. Build benchmarking infrastructure, establish per-config baselines, and work with system-owning teams to improve utilization.
Build internal tooling and platforms that enable capacity planning, workload attribution, and cluster debugging. The consumers are other engineering teams, finance, and leadership — not external users.
Operate Kubernetes-native systems at scale — deploying data collection agents, managing workload labeling infrastructure, and understanding how taints, reservations, and scheduling affect capacity.
Normalize and reconcile data across heterogeneous sources — including AWS, GCP, and Azure billing exports, vendor-specific telemetry formats, and internal systems with different schemas and billing arrangements.
Collaborate across organizational boundaries with research engineering, infrastructure, inference, and finance teams. Gather requirements from technical stakeholders, translate them into useful systems, and communicate trade-offs to non-technical audiences.