Staff+ Software Engineer - Infrastructure

AnthropicSan Francisco, CA
55dHybrid

About The Position

Anthropic is seeking talented and experienced Infrastructure Engineers to join our team and support the development, scaling, and maintenance of our cutting-edge AI systems. By joining our Infrastructure team, you will have the opportunity to work on groundbreaking AI technologies and contribute to the development of frontier models, supporting Anthropic's mission to create safe and reliable AI systems that benefit humanity. We have multiple teams that are currently hiring. Team placement occurs after the interview process, taking into account your interests and experience alongside organizational needs. This flexible approach allows us to match talented engineers with the infrastructure teams where they'll have the greatest impact and growth potential: Data Infrastructure: We build and maintain the data systems powering Anthropic's AI research and products. You'll design and optimize data pipelines using tools like Spark, Airflow, and dbt across GCP and AWS. Your work will ensure reliable, scalable data infrastructure while implementing governance best practices and driving continuous improvement. Core Infrastructure: The systems team is responsible for supporting some of the largest, most sophisticated clusters in industry used to train, research, and ultimately serve AI models. Your work will be crucial in ensuring Anthropic is able to continue reliably and safely training frontier models. You will be responsible for building systems and running large Kubernetes clusters with GPU/TPU/Trainium workloads. Observability: We build and maintain the infrastructure that monitors the health, performance, and efficiency of our AI systems. You'll work across teams to implement monitoring solutions using tools like Prometheus, Splunk, and Grafana, while developing automated approaches for dashboards and alerts. Your work will create reliable, low-maintenance systems that enable proactive monitoring and operational excellence. Developer Productivity: The Developer Productivity team enables Anthropic researchers and engineers to be maximally effective in securely developing state-of-the-art models, and products that expose those models to users. All of the code written at Anthropic goes through systems/infrastructure built and maintained by our team. We aim to make development at Anthropic secure, efficient, and delightful. Developer Acceleration:The Developer Acceleration puts Anthropic on the forefront of engineering productivity by deeply integrating Claude at every step and ensuring engineers get well configured, optimized dev environments. We own the development setup for engineers and Claude alike, focusing on deeply integrating Claude everywhere so Claude can do hours of work independently and engineers have a great experience. Databases: The Databases team is responsible for building and scaling a reliable OLTP database platform for both Product and Research. We are responsible for the online SQL, NoSQL, Vector stores, and KV stores used across the organization. Privacy Infrastructure: Privacy engineering team focuses on building policy enforcement and data management solutions to ensure all flows across Anthropic are following all the legal and privacy requirements related to privacy of user data. This team is DRI for all privacy-related requirements across all pillars of Anthropic - product, infra, research and security. AI Reliability: The AI Reliability Engineering team at Anthropic pioneers the future of systems reliability in the AI era, developing and achieving reliability targets across all our products-from public-facing web, API, and mobile services to backend training infrastructure. We execute engineering projects ensuring service reliability while collaborating cross-functionally with product and research teams to enhance availability, manageability, and functionality of Anthropic's systems. As the team ultimately responsible for all service reliability, we take operational ownership of our largest products while innovating at the intersection of advanced model capabilities and time-tested engineering practices.

Requirements

  • Have 10+ years of relevant industry experience, 3+ years leading large scale, complex projects or teams as an engineer or tech lead
  • Are obsessed with distributed systems at scale, infrastructure reliability, scalability, security, and continuous improvement
  • Strong proficiency in at least one programming language (e.g., Python, Rust, Go, Java)
  • Strong problem-solving skills and ability to work independently
  • Have a passion for supporting internal partners like research to understand their needs
  • Have excellent communication skills to build consensus with stakeholders, both internally and externally
  • Possess deep knowledge of modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP

Nice To Haves

  • Security and privacy best practice expertise
  • Experience with machine learning infrastructure like GPUs, TPUs, or Trainium, as well as supporting networking infrastructure like NCCL
  • Low level systems experience, for example linux kernel tuning and eBPF
  • Technical expertise: Quickly understanding systems design tradeoffs, keeping track of rapidly evolving software systems

Responsibilities

  • Lead build out of industry-leading AI clusters (thousands to hundreds of thousands of machines), partnering closely with cloud service providers on cluster build out and required features
  • Consult with different stakeholders to deeply understand infrastructure, data and compute needs, identifying potential solutions to support frontier research and product development
  • Set technical strategy and oversee development of high scale, reliable infrastructure systems.
  • Mentor top technical talent
  • Design processes (e.g. postmortem review, incident response, on-call rotations) that help the team operate effectively and never fail the same way twice

Benefits

  • competitive compensation and benefits
  • optional equity donation matching
  • generous vacation and parental leave
  • flexible working hours
  • a lovely office space in which to collaborate with colleagues

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Publishing Industries

Number of Employees

1,001-5,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service