Staff+ Software Engineer, Infrastructure

AnthropicSan Francisco, NY
1dHybrid

About The Position

Anthropic is seeking talented and experienced Infrastructure Engineers to join our team and support the development, scaling, and maintenance of our cutting-edge AI systems. By joining our Infrastructure team, you will have the opportunity to work on groundbreaking AI technologies and contribute to the development of frontier models, supporting Anthropic's mission to create safe and reliable AI systems that benefit humanity. Anthropic's Infrastructure organization is the engine that powers our mission to develop AI systems that are safe, beneficial, and understandable. Every breakthrough in AI safety research and every interaction users have with Claude depends on the systems we build and operate: massive clusters for training, production infrastructure serving millions of users reliably, and developer platforms that help engineers move fast without breaking things. And even with that, this isn't typical infrastructure work. We're building at the frontier of what's possible, solving novel scaling challenges that few organizations face with a high degree of security, all in service of ensuring transformative AI benefits humanity. If you're energized by your technical work directly enabling some of the most important research happening today, Infrastructure at Anthropic is the best place to make a real difference. We have multiple teams that are currently hiring. Team placement occurs after the interview process, taking into account your interests and experience alongside organizational needs. This flexible approach allows us to match talented engineers with the infrastructure teams where they'll have the greatest impact and growth potential.

Requirements

  • Have 10+ years of relevant industry experience, 3+ years leading large scale, complex projects or teams as an engineer or tech lead
  • Are obsessed with distributed systems at scale, infrastructure reliability, scalability, security, and continuous improvement
  • Strong proficiency in at least one programming language (e.g., Python, Rust, Go, Java)
  • Strong problem-solving skills and ability to work independently
  • Have a passion for supporting internal partners like research to understand their needs
  • Have excellent communication skills to build consensus with stakeholders, both internally and externally
  • Possess deep knowledge of modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP
  • We require at least a Bachelor's degree in a related field or equivalent experience.

Nice To Haves

  • Security and privacy best practice expertise
  • Experience with machine learning infrastructure like GPUs, TPUs, or Trainium, as well as supporting networking infrastructure like NCCL
  • Low level systems experience, for example linux kernel tuning and eBPF
  • Technical expertise: Quickly understanding systems design tradeoffs, keeping track of rapidly evolving software systems

Responsibilities

  • Lead build out of industry-leading AI clusters (thousands to hundreds of thousands of machines), partnering closely with cloud service providers on cluster build out and required features
  • Consult with different stakeholders to deeply understand infrastructure, data and compute needs, identifying potential solutions to support frontier research and product development
  • Set technical strategy and oversee development of high scale, reliable infrastructure systems.
  • Mentor top technical talent
  • Design processes (e.g. postmortem review, incident response, on-call rotations) that help the team operate effectively and never fail the same way twice

Benefits

  • We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service