Technical Program Manager, RL Infrastructure & Reliability

DeepMindMountain View, CA
59d$156,000 - $229,000

About The Position

As a Technical Program Manager for Reinforcement Learning (RL) Infrastructure & Reliability, you will focus on a critical, rapidly evolving area: the post-training stack that refines and improves Gemini models. You will be a hands-on driver of technical programs, embedding with engineering teams to enhance the reliability, performance, and scalability of the infrastructure that powers our most advanced RL workloads. This role is for a TPM who thrives on ambiguity and technical depth. You will lead concrete engineering initiatives, from driving performance optimization programs to owning the execution of reliability roadmaps. Your work will have a direct and measurable impact on the quality of our models and the velocity of our research.

Requirements

  • Bachelor's degree in a technical field or equivalent practical experience.
  • 5 years of experience in program or project management in a technical software environment.
  • Experience working directly with engineering teams on the software development lifecycle.

Nice To Haves

  • 5+ years of relevant work experience.
  • Experience with machine learning workflows, particularly in training, post-training, or MLOps. Direct experience with Reinforcement Learning (RL) is a strong plus.
  • Strong analytical skills, with experience in performance analysis, reliability engineering (SRE), or technical efficiency projects.
  • Proficiency with project management and development tools (e.g., Jira, Gantt charts) for managing technical backlogs.
  • Excellent interpersonal and communication skills, with a demonstrated ability to work effectively in ambiguous, fast-paced R&D environments.

Responsibilities

  • Performance & Efficiency Optimization: Drive technical programs focused on optimizing the performance and efficiency of post-training and RL workloads. This includes quantitative analysis, developing shared dashboards, and guiding engineering execution on improvements.
  • Reliability Roadmap Execution: Execute key projects from the post-training reliability roadmap, such as improving monitoring tools and centralizing core services, to enhance the stability of the entire stack.
  • Code Health Initiatives: Own the technical project management for initiatives aimed at improving the long-term health, testability, and maintainability of the RL infrastructure codebases.
  • Roadmap & Backlog Management: Manage the engineering backlog and tactical execution for core RL framework development, ensuring progress is tracked and aligned with the team's strategic roadmap.
  • Cross-Functional Coordination: Build effective working relationships with engineering teams, guiding alignment on project goals, managing interdependencies, and ensuring clear communication and risk management.
  • Program Governance: Contribute to the broader program management of the Frameworks and Infrastructure team, providing clear stakeholder updates and supporting team-wide events.

Benefits

  • bonus
  • equity
  • benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service