Infrastructure Engineer, ML Systems

Applied ComputeSan Francisco, CA
20hOnsite

About The Position

Applied Compute builds Specific Intelligence for enterprises, unlocking the knowledge inside a company to train custom models and deploy an in-house agent workforce. Today’s state-of-the-art AI isn’t one-size-fits-all—it’s a tailored system that continuously learns from a company’s processes, data, expertise, and goals. The same way companies compete today by having the best human workforce, the companies building for the future will compete by having the best agent workforce supporting their human bosses. We call this Specific Intelligence and we’re already building them today. We are a small, talent-dense team of engineers, researchers, and operators who have built some of the most influential AI systems in the world, including reinforcement learning infrastructure at OpenAI and data foundations at Scale AI, with additional experience from Together, Two Sigma, and Watershed. We’re backed with $80M from Benchmark, Sequoia, Lux, Hanabi, Neo, Elad Gil, Victor Lazarte, Omri Casspi, and others. We work in-person in San Francisco. The Role As a founding Infrastructure Engineer, ML Systems , you’ll be responsible for designing, implementing, and and optimizing large-scale machine learning systems that power both customer deployments and frontier reinforcement learning research. Frontier systems are exciting yet brittle, and require diligence and attention to detail to engineer correctly. We value performance with correctness - on their own, each of these are necessary but not sufficient to train frontier models effectively. You’ll work closely with our researchers and product engineers to bring frontier LLM post training software into enterprise deployments. This role is perfect for systems enthusiasts who thrive with implementing high performance, reliable systems at scale.

Requirements

  • Fearlessness and curiosity to understand all levels of the training system
  • Uncompromising desire to learn and keep up with frontier techniques
  • Background in programming with and managing training jobs large scale GPU systems
  • Bias toward fast implementation, paired with a high bar for reliability and efficiency
  • Experience with open-weights models (architecture and inference)
  • Background in reinforcement learning or integration of inference with RL training loops
  • Demonstrated technical creativity through published projects, OSS contributions, or side projects

Responsibilities

  • Design and optimize a frontier LLM post training stack, optimizing our training and inference pipelines
  • Implement and debug systems with an eye towards how they affect ML (e.g., using low precision numerics)
  • Design tooling and observability to allow researchers and customers to inspect and profile our large training systems

Benefits

  • Applied Compute offers generous health benefits
  • unlimited PTO
  • paid parental leave
  • lunches and dinners at the office
  • relocation support as needed
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service