Staff Technical Program Manager – GenAI Ops & Capacity Planning

Databricks•Mountain View, CA

24d

About The Position

Databricks is looking for a Staff Technical Program Manager to drive GenAI Operations and Capacity Planning for our large-scale LLM and GPU-backed platform. This role is designed for a senior, hands-on TPM who thrives in technically deep, data-driven environments and enjoys owning complex operational programs end to end. As a Staff TPM, you will own execution for critical GenAI operational initiatives, operate with significant autonomy, and partner closely with AI/ML engineering, infrastructure, finance, partner ops and cloud/LLM providers. You will use strong analytical skills to guide decisions, surface risks, and continuously improve how Databricks launches, scales, and governs GenAI workloads. You will report to a Technical Program Leader and operate across multiple time zones in a fast-moving, highly ambiguous environment.

Requirements

10+ years of overall industry experience, including 7+ years in Technical Program Management.
Experience leading cross-functional GenAI, AI/ML, or infrastructure programs from planning through launch and steady-state operations.
Strong background in capacity planning, forecasting, and infrastructure analytics.
Advanced SQL skills and hands-on experience building analytics, dashboards, and operational reporting.
Ability to translate complex data into clear insights and recommendations for engineering and leadership stakeholders.
Hands-on experience with at least one major cloud provider: AWS, Azure, or GCP.
Familiarity with agile methodologies and program management tools such as Jira.
Comfortable managing ambiguity, driving execution, and handling escalations when needed.

Nice To Haves

Master’s degree or advanced technical degree.
Experience operating LLM, GPU, or GenAI platforms in production environments.
Background in cloud infrastructure, distributed systems, or platform engineering.
Previous software or hardware development experience.

Responsibilities

Plan and execute day-0 launches of new LLM models on Databricks, ensuring production readiness across engineering,commercialization,go-to-market, legal and cloud service partners
Partner with AI/ML and platform engineering teams to operationalize LLM onboarding, rollout, and lifecycle management.
Define and maintain launch checklists, operational runbooks, and success metrics for GenAI workloads.
Own GPU and LLM capacity planning, forecasting, and allocation for GenAI workloads.
Build and maintain SQL-driven analytical models and dashboards to forecast demand, track utilization, and surface capacity risks.
Balance customer demand, growth trajectories, and contractual commitments to inform short- and medium-term capacity decisions.
Track and drive efficient consumption of GPU and LLM capacity, identifying underutilization, contention, and inefficiencies.
Define and monitor KPIs for utilization, efficiency, and reliability of GenAI platforms.
Use data to recommend improvements to engineering roadmaps, operational processes, and cost optimization efforts.
Execute governance mechanisms to ensure GenAI capacity usage aligns with contractual, financial, and compliance requirements.
Produce clear, data-backed reporting for senior leaders on capacity health, utilization trends, and operational risks.
Generate consumption reports, usage metrics reporting and share of wallet attestations
Ensure documentation, controls, and processes are audit-ready and consistently followed.

Benefits

At Databricks, we strive to provide comprehensive benefits and perks that meet the needs of all our employees. For specific details on the benefits offered in your region, please visit https://www.mybenefitsnow.com/databricks.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume