Senior Technical Program Manager

Crusoe•Sunnyvale, CA

9d•$161,700 - $196,000

About The Position

Crusoe's Cloud Product team is hiring a Senior Technical Program Manager to own and drive technical programs across our AI IaaS platform, aligning Product and Engineering on delivery. This role requires a hands-on individual contributor with genuine technical depth in AI infrastructure, comfortable operating in ambiguous, fast-moving environments and capable of building execution structure where little exists. Our vision is the easiest-to-use AI purpose-built cloud. We offer IaaS products, letting AI/ML engineers focus on AI model frameworks (PyTorch, Ray) and compute stacks (CUDA, ROCm) while Crusoe manages the underlying complexity. Our platform standardizes firmware/OS bundles and automates component orchestration for consistent, scalable infrastructure. We also offer AI Managed Services, like SLA-bounded Managed Inference. The TPM connects engineering, product, procurement, and data center operations to deliver a reliable platform where customers run AI workloads without managing low-level system details. At this level, TPM engagement focuses on defined technical programs and workstreams within larger cross-functional initiatives: GPU cluster commissioning, feature delivery within IaaS products, NPI workstream ownership, and coordination across hardware and software dependencies. The ideal candidate is engineer-rooted with hands-on technical depth in compute infrastructure, firmware, or hardware platform delivery. You will own defined programs end-to-end, build lightweight execution frameworks, govern dependencies within your scope, and grow into broader program ownership over time.

Requirements

AI Infrastructure Technical Depth: Genuine hands-on knowledge of the AI infrastructure stack — GPU firmware, drivers, BIOS/BMC, CUDA or ROCm stacks, OS configurations, and the tooling required to bring a cluster to production. Ability to engage credibly with firmware, hardware, networking, and compute engineers without a translator.
Commissioning Fluency: Familiarity with GPU cluster commissioning — understanding of the distinct workstreams (Compute, SDN, ZTP, Firmware, Network, Storage), commissioning gate definitions, and what it means to take ownership of a cluster from RFN through GA.
Experience at Hyperscalers or Neoclouds: 5–10 years of experience as a Technical Program Manager, with meaningful tenure at a hyperscaler (AWS, Azure, Google, Meta, OCI) or neocloud (CoreWeave, Lambda Labs) in a hardware, firmware, compute platform, or IaaS delivery role. Must have been on the build side — not customer-facing or enterprise IT delivery.
AI Tool Integration: Active daily use of AI tools to enhance program execution, tracking, and communication.
Execution Rigor: Demonstrated ability to establish program predictability in ambiguous environments. Builds lightweight structure that engineering teams trust and adopt.
Communication: Clear written and verbal communication for delivering program status, risks, and dependencies to engineering leads and product managers.

Nice To Haves

Bachelor's or Master's degree in Engineering, Computer Science, or a related technical field
Hands-on engineering background — hardware engineering, systems engineering, silicon validation, embedded software, or firmware
Experience with GPU NPI programs, firmware deployment across compute fleets, or datacenter commissioning
Experience at a hyper-growth company

Responsibilities

Workstream & Program Ownership: Own delivery of defined feature sets or commissioning workstreams within larger IaaS and NPI programs. Drive execution from kickoff through GA or handoff milestone.
Hands-On Execution: Personally manage technically complex programs within your scope. Identify blockers early, intervene to correct course, and escalate cross-organizational issues promptly when they exceed your authority.
Commissioning Leadership: Own the Cloud TPM side of GPU cluster commissioning for assigned deployments — define RFN acceptance criteria, commissioning readiness requirements, and Go/No-Go criteria. Lead Phase 4.0 commissioning activities including network fabric, storage provisioning, GPU validation, burn-in, and monitoring integration.
Risk & Dependency Governance: Proactively identify technical and organizational risks within your programs. Maintain dependency logs, surface blockers before scheduled reviews, and drive issues to resolution.
Execution Frameworks: Build lightweight, fit-for-purpose execution structures for your programs — standups, dashboards, status updates, and risk logs. Establish predictability within your workstream.
External Partner Coordination: Coordinate with external dependencies such as hardware partners, ODMs, or data center operations teams on assigned programs. Support certification, validation, and infrastructure stack alignment under guidance of senior TPMs.