Senior Technical Product Manager, Fleet Operations

Nscale

8d•$200,000 - $280,000

About The Position

Nscale is building a vertically integrated GenAI cloud platform, focusing on sustainable technology solutions. As a Senior Technical Product Manager for Fleet Operations, you will own the product strategy for the operational software that manages Nscale's global GPU fleet. This includes systems for bringing capacity online, maintaining its health, and rapidly restoring it when issues arise. You will collaborate closely with engineering, design, research, and go-to-market teams to translate customer problems and operational realities into successful product outcomes. Your role involves partnering daily with Fleet Software engineering teams, SRE, and Support to develop durable products that address operational challenges in provisioning, testing, deployment, monitoring, incident response, repair, and decommissioning. You will operate at the team scope, taking ownership of a major product area and driving initiatives that span multiple quarters. Your work will directly impact fleet availability, utilization, and time-to-recover. This role is ideal for someone passionate about owning the software that keeps a global GPU fleet running and improving the team's overall performance.

Requirements

5–8 years of product management experience in software or technology.
Track record of owning significant product areas in infrastructure, platform, or operations-facing products.
Strong technical fluency in large-scale systems, capable of leading discussions with engineering on architecture, trade-offs, and feasibility across provisioning, orchestration, observability, and control-plane design.
Experience building products for operators (SREs, NOC/support teams, data centre technicians, or similar) with a genuine appetite for understanding their workflows.
Demonstrated ability to move from ambiguous operational problems to shipped product outcomes that measurably improve reliability, efficiency, or time-to-recover.
Experience mentoring or informally leading peers.
Excellent written and verbal communication skills, able to make complex product decisions clear to engineers, operators, and executives.

Nice To Haves

Degree in computer science, engineering, or a related field, or prior experience as an engineer or SRE.
Hands-on background in cloud infrastructure, bare-metal provisioning, fleet or hardware lifecycle management, observability/monitoring platforms, or incident management tooling.
Experience with bare-metal provisioning systems (e.g., OpenStack Ironic, MAAS, Tinkerbell, or in-house stacks).
Experience with DCIM tools (e.g., NetBox, Device42, Nautobot) for inventory, cabling, and rack/asset management.
Experience with ITSM and ticketing platforms (e.g., Jira Service Management, ServiceNow, Zendesk, Freshservice) for support, incident, and RMA workflows.
Experience with observability and monitoring platforms (e.g., Grafana, Prometheus, Datadog), including defining SLOs, dashboards, and alerting for large fleets.
Familiarity with GPU or accelerated compute environments, data centre operations, or hyperscaler-style fleet management.
Experience operating in high-growth or early-stage environments where the product is being built alongside the fleet itself.

Responsibilities

Own the strategy and roadmap for a significant Fleet Operations product area (e.g., provisioning and bring-up, fleet health and telemetry, incident and repair workflows, firmware and lifecycle management, or capacity and inventory).
Lead multi-sprint, cross-functional initiatives from problem framing through rollout across live GPU clusters, working hand-in-hand with Fleet Software, SRE, data centre operations, and Support.
Translate operational ambiguity into product by shadowing on-call rotations, riding along with support and repair workflows, and converting recurring toil into tooling, automation, and platform capabilities.
Define key metrics for a GPU fleet (availability, utilization, MTTR, time-to-bring-up, hardware failure rates, support ticket deflection) and drive the roadmap against them.
Partner with engineering on architecture and trade-offs for systems spanning bare metal, orchestration, observability, and control planes.
Drive incident reviews and postmortems into product commitments to prevent recurrence of similar issues.
Mentor junior product managers and elevate the quality of PRDs, reviews, and product decisions across the team.
Represent Fleet Operations in planning, reviews, and leadership updates.