Network Engineer, Capacity and Efficiency

Anthropic•San Francisco, CA

53d•Hybrid

About The Position

Anthropic's mission is to create reliable, interpretable, and steerable AI systems. The Capacity & Efficiency team, within Anthropic’s Compute organization, is responsible for the cost, utilization, and attribution of non-accelerator infrastructure, including the network, compute, and storage backbone that handles petabytes of data across clouds and regions. This role is a hands-on individual contributor position focused on using deep networking knowledge and rigorous measurement to optimize bandwidth, latency, and costs. The engineer will be responsible for the observability and efficiency of Anthropic’s network, from per-flow telemetry to cost attribution for research teams. The role involves writing code (Python, Go), building dashboards, modeling capacity, and influencing architectural decisions. Key focus areas include network telemetry, observability, and cost modeling/attribution, with an expectation of strength in at least two of these areas and a willingness to grow in the third.

Requirements

5+ years operating large-scale production networks (data center fabrics, backbone/WAN, or hyperscaler-adjacent environments).
Fluency across BGP (policy and communities), ECMP, VXLAN/EVPN or equivalent overlays, QoS (DSCP, queuing, shaping), and L1/optical basics (DWDM, coherent, LAGs).
Deep knowledge of at least one major CSP’s networking model (AWS or GCP) and understanding of overlay/underlay interactions.
Experience building or operating network telemetry at scale (streaming telemetry, flow export, or eBPF-based instrumentation), with an understanding of sampling, cardinality, and storage tradeoffs.
Proficiency in writing Python or Go for tooling, telemetry pipelines, infrastructure-as-code, and network automation.
Quantitative thinking: ability to use data (notebooks, Grafana queries) to drive decisions and build cost models from counter data.
Clear communication skills: ability to explain technical and financial impacts to various stakeholders.

Nice To Haves

SRE experience for large-scale network infrastructure, including designing for reliability, defining SLOs/SLIs, capacity planning with error budgets, and incident response.
Background on a cloud provider's networking team or cloud networking product team.
Familiarity with AI/ML infrastructure traffic patterns (collective communication, checkpoint/weight transfer, inference serving) and their network impact.
Experience with HPC fabrics (InfiniBand, RoCE v2, lossless Ethernet) and their operational aspects.
Background in traffic engineering for large backbones.
Hands-on experience with multi-cloud connectivity and associated billing models.
Experience building cost/chargeback systems for shared infrastructure or FinOps experience in a large cloud environment.

Responsibilities

Build the network observability stack, including designing and deploying telemetry pipelines (sFlow/IPFIX, gNMI streaming, eBPF host probes) to gather per-flow, per-tenant, and per-workload cost and utilization data.
Own the SLIs for backbone and DCN fabric health.
Analyze inter-region traffic patterns, identify hot links and stranded capacity, and quantify the dollar impact to find efficiency opportunities.
Build models to determine whether to acquire more capacity or migrate workloads.
Design and operate QoS and traffic engineering across the backbone, ensuring efficient transfer of data without impacting latency-sensitive inference.
Drive cost attribution by tying network spend (egress, interconnect ports, transit, optical leases) to the teams and workloads generating it.
Influence decisions made by other teams by presenting data-driven insights on traffic patterns, capacity needs, and QoS policies.
Partner closely with Systems Networking on fabric architecture and Observability on telemetry platform integration.
Automate network configuration and tooling to implement efficiency findings safely and effectively.