Production Engineer, Compute

Fluidstack•San Francsisco, CA

2d•$175,000 - $300,000

About The Position

Fluidstack is building civilization-scale infrastructure for AI, focusing on delivering massive amounts of compute power faster than anyone else. We are rethinking every layer of the stack, from acquiring power to designing, building, and operating data centers with teams spanning hardware and software. Speed and scale are our key differentiators. We are looking for individuals who care deeply about this problem space and are motivated to contribute to building this infrastructure. Our operating principles include high ownership with full autonomy, driving everything forward with velocity, challenging assumptions with first principles thinking, and a passion for the frontier of AI. We operate at high intensity to push the frontier forward.

Requirements

Treat toil as a bug; view manual steps in repair workflows as backlog items, not part of the job description.
Possess an instinct for hardware, comfortable reasoning about failure modes at the firmware and silicon level.
Move towards ambiguity, build maps in uncharted territory, and explain them clearly.
Learn at a steep slope, achieving competence in unfamiliar domains quickly.
Carry a pager without flinching, run incidents, write postmortems, and fix systemic causes.
Be fluent with AI tooling such as LLM APIs, MCP servers, and agentic frameworks, and proficient with AI coding tools like Claude Code or Cursor.
Have experience shipping production automation that other teams depend on.
Be comfortable in any language when using AI coding tools.

Nice To Haves

Hardware lifecycle management and RMA automation.
BMC/Redfish or IPMI tooling.
GPU/TPU qualification or burn-in frameworks.
Workflow and orchestration engines (e.g., Temporal, Cadence).
Metrics and alerting pipelines (e.g., Prometheus, Grafana).
Experience with Go or Python.

Responsibilities

Own compute fleet health end to end, including building metrics pipelines, alerting, and a unified health view for GPUs and TPUs across Kubernetes-orchestrated workloads and bare metal at scale.
Transform repair from a manual procedure into an automated pipeline, managing the process from failure detection through triage, parts management, and return to service.
Design and expand the XPU qualification platform, defining standards for burn-in, performance baselining, and NPI execution for new GPU and TPU generations before they are used for customer workloads.
Own Redfish and BMC tooling, including firmware-level telemetry, fleet-scale log collection, and the low-level access layer essential for repair automation and health tooling.
Ensure end-to-end reliability, scalability, and operation of the compute fleet at scale, leveraging aggressive automation, tooling, and incident discipline to manage one of the world's largest XPU fleets.