Staff Technical Program Manager, Managed Intelligence

Crusoe•Sunnyvale, CA

5d•$193,050 - $234,000

About The Position

Crusoe is the world's first vertically integrated, sustainable AI cloud. We build and operate GPU infrastructure powered by clean energy, from data center design through IaaS products to managed inference at scale, enabling AI-native companies to run demanding workloads without compromising on sustainability or reliability. Crusoe Cloud is 1,400 people and growing, and the TPM frameworks are still being built -- which means there is a genuine opportunity to shape how the function operates rather than inherit how it already works. The Managed Inference platform is where customers run production LLM workloads without managing low-level infrastructure, and it is one of Crusoe's fastest-growing product areas. The Staff TPM for Managed Intelligence connects model engineering, IaaS, product, and data center operations to deliver a reliable, scalable inference platform. You will own end-to-end program delivery across multi-quarter roadmaps, model onboarding, inference optimization, and production readiness for new model versions. Deep familiarity with the model layer -- including how LLMs are served, optimized, and evaluated in production -- is essential to being effective in this role.

Requirements

7+ years of experience as a Technical Program Manager in fast-paced technical environments, with a track record of owning complex programs end-to-end across engineering and product organizations.
LLM inference and model serving knowledge: Working familiarity with batching strategies, quantization approaches, and the tradeoffs that govern latency, throughput, and cost at production scale.
Multi-tenant systems experience: Familiarity with isolation, quota management, and SLA enforcement across concurrent workloads.
Fine-tuning and alignment awareness: Sufficient familiarity with fine-tuning and alignment workflows to govern program timelines, identify technical risks, and coordinate across the teams that own them.
Low-structure execution: Proven ability to build execution models in environments where the process did not yet exist, and make them stick with teams that didn't ask for them.
Executive communication: Exceptional written and verbal communication for delivering clear, data-driven, decision-oriented updates to executive stakeholders.
AI tool integration: Active, daily use of AI tools to improve program execution, risk detection, and communication -- not just personal productivity.
Cross-functional influence: Proven ability to drive alignment across engineering, product, and infrastructure leadership without direct authority, including with highly technical stakeholders.

Nice To Haves

1+ years of experience working with teams building platforms or services for AI inference and/or training.
Direct experience governing model onboarding programs across GPU generations, including firmware, driver, and stack validation.
Experience coaching or mentoring junior TPMs in a high-growth technical environment.
Exposure to multi-site or globally distributed engineering teams.
Background at a Series D to Series F company or a high-performing team within a hyperscaler focused on AI infrastructure.

Responsibilities

End-to-end program delivery: Own multi-quarter release planning, dependency governance, and executive communication across the Managed Inference platform.
Complex, high-risk program management: Drive model version rollouts, inference optimization campaigns, SLA readiness for new GPU hardware, and multi-tenant capacity planning from kickoff through delivery.
Cross-functional alignment: Coordinate across Model Engineering, IaaS, Cloud Foundations, Data Center Operations, and external model providers to keep programs on track and unblocked.
Proactive risk identification: Surface risks across model serving, reliability, capacity constraints, and vendor timelines before they become program-level problems.
Execution frameworks and dashboards: Build lightweight, scalable TPM frameworks suited to Crusoe's pace; maintain real-time execution dashboards and deliver crisp, data-driven executive updates.
Phase 0 planning for model onboarding: Own pre-launch planning for model onboarding on new GPU generations, including firmware and driver readiness, CUDA and ROCm stack validation, and commissioning criteria for inference workloads.
Stakeholder leadership: Drive alignment and push back effectively across engineering, product, and operations leadership -- including highly technical stakeholders who have not previously worked with a TPM.