About The Position

Databricks is looking for a Staff Technical Program Manager to drive GenAI Operations and Capacity Planning for our large-scale LLM and GPU-backed platform. This role is designed for a senior, hands-on TPM who thrives in technically deep, data-driven environments and enjoys owning complex operational programs end to end. As a Staff TPM, you will own execution for critical GenAI operational initiatives, operate with significant autonomy, and partner closely with AI/ML engineering, infrastructure, finance, partner ops and cloud/LLM providers. You will use strong analytical skills to guide decisions, surface risks, and continuously improve how Databricks launches, scales, and governs GenAI workloads. You will report to a Technical Program Leader and operate across multiple time zones in a fast-moving, highly ambiguous environment.

Requirements

  • 10+ years of overall industry experience, including 7+ years in Technical Program Management.
  • Experience leading cross-functional GenAI, AI/ML, or infrastructure programs from planning through launch and steady-state operations.
  • Strong background in capacity planning, forecasting, and infrastructure analytics.
  • Advanced SQL skills and hands-on experience building analytics, dashboards, and operational reporting.
  • Ability to translate complex data into clear insights and recommendations for engineering and leadership stakeholders.
  • Hands-on experience with at least one major cloud provider: AWS, Azure, or GCP.
  • Familiarity with agile methodologies and program management tools such as Jira.
  • Comfortable managing ambiguity, driving execution, and handling escalations when needed.

Nice To Haves

  • Master’s degree or advanced technical degree.
  • Experience operating LLM, GPU, or GenAI platforms in production environments.
  • Background in cloud infrastructure, distributed systems, or platform engineering.
  • Previous software or hardware development experience.

Responsibilities

  • Plan and execute day-0 launches of new LLM models on Databricks, ensuring production readiness across engineering,commercialization,go-to-market, legal and cloud service partners
  • Partner with AI/ML and platform engineering teams to operationalize LLM onboarding, rollout, and lifecycle management.
  • Define and maintain launch checklists, operational runbooks, and success metrics for GenAI workloads.
  • Own GPU and LLM capacity planning, forecasting, and allocation for GenAI workloads.
  • Build and maintain SQL-driven analytical models and dashboards to forecast demand, track utilization, and surface capacity risks.
  • Balance customer demand, growth trajectories, and contractual commitments to inform short- and medium-term capacity decisions.
  • Track and drive efficient consumption of GPU and LLM capacity, identifying underutilization, contention, and inefficiencies.
  • Define and monitor KPIs for utilization, efficiency, and reliability of GenAI platforms.
  • Use data to recommend improvements to engineering roadmaps, operational processes, and cost optimization efforts.
  • Execute governance mechanisms to ensure GenAI capacity usage aligns with contractual, financial, and compliance requirements.
  • Produce clear, data-backed reporting for senior leaders on capacity health, utilization trends, and operational risks.
  • Generate consumption reports, usage metrics reporting and share of wallet attestations
  • Ensure documentation, controls, and processes are audit-ready and consistently followed.

Benefits

  • At Databricks, we strive to provide comprehensive benefits and perks that meet the needs of all our employees. For specific details on the benefits offered in your region, please visit https://www.mybenefitsnow.com/databricks.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service