Principal Operations Engineer, Hardware — Data Center Operations

Fluidstack

4d•$150,000 - $250,000•Onsite

About The Position

We are seeking a Principal Operations Engineer, Hardware to serve as the most senior technical authority for the operational hardware fleet across our hyperscale AI data center portfolio. AI infrastructure lives and dies on the reliability of the compute itself — this role exists to ensure that the GPU systems, servers, and supporting hardware we deploy at scale are operated, maintained, and continuously improved at the standard the workload demands. You will operate as the technical arm of senior operations leadership in the field — leading site assessments and operational audits, driving the technical readiness of teams ahead of site activation, reviewing hardware platforms and integration designs from an operational lens, and feeding operational learnings back into the hardware engineering, deployment, and supply chain organizations as we shift toward a productized, repeatable build model. You will be a force multiplier across our site hardware leads, deployment teams, and reliability engineers, and the connective tissue between hardware operations, hardware engineering, network, facilities, and customer-facing teams. The ideal candidate has spent a career operating hardware at scale — in hyperscale data centers, large HPC environments, or comparable 24/7 infrastructure — and is equally comfortable diagnosing a stubborn boot failure on the floor, leading a fleet-wide root cause investigation, and pushing back on a vendor on a flawed RMA process. Formal engineering credentials are valued but not required — practical depth, judgment under pressure, the ability to teach, and the discipline to keep critical infrastructure running through change are what define this role.

Requirements

10+ years of hands-on experience operating mission-critical hardware infrastructure, with at least 5 years as the senior technical voice on a site, campus, or fleet.
Deep working command of GPU systems, server platforms, storage infrastructure, firmware lifecycle management, and hardware diagnostics — earned in the field, not from a textbook.
Demonstrated ability to author, approve, and execute high-risk MOPs and change records in live production environments.
A track record of leading root cause analysis on significant hardware events and driving corrective actions to closure.
A track record of holding OEMs, ODMs, service vendors, and deployment partners accountable — you know how to enforce a standard without burning the relationship.
Strong written communication: operational health assessments, RCAs, procedure reviews, and design review feedback are second nature.
Comfort operating as the senior technical voice across operations, hardware engineering, network, facilities, supply chain, and customer-facing teams.

Nice To Haves

Bachelor's degree in Computer Engineering, Electrical Engineering, Computer Science, or related field.
Hyperscale or large-scale compute operational experience supporting thousands of servers and accelerator systems.
Direct experience operating modern GPU platforms at production scale.
Strong working knowledge of Linux administration, hardware management tooling, and production troubleshooting workflows.
Experience supporting liquid-cooled compute infrastructure and the operational practices required to maintain it.
Experience operating across multiple sites or as part of a global fleet operations function.
Experience standing up new sites from deployment handover through steady-state.
Experience contributing operational requirements into hardware platform decisions, reference architectures, or productized data center builds.
Scripting and automation experience in support of fleet-scale hardware operations.

Responsibilities

10+ years of hands-on experience operating mission-critical hardware infrastructure, with at least 5 years as the senior technical voice on a site, campus, or fleet.
Data center operations experience strongly preferred; hyperscale, large HPC, cloud, or other mission-critical compute infrastructure experience considered.
Deep working command of GPU systems, server platforms, storage infrastructure, firmware lifecycle management, and hardware diagnostics — earned in the field, not from a textbook.
Demonstrated ability to author, approve, and execute high-risk MOPs and change records in live production environments.
A track record of leading root cause analysis on significant hardware events and driving corrective actions to closure.
A track record of holding OEMs, ODMs, service vendors, and deployment partners accountable — you know how to enforce a standard without burning the relationship.
Strong written communication: operational health assessments, RCAs, procedure reviews, and design review feedback are second nature.
Comfort operating as the senior technical voice across operations, hardware engineering, network, facilities, supply chain, and customer-facing teams.
Willingness to travel extensively across the fleet. 50-75%.