Production Engineer, Network

Fluidstack•London, London

3d•$175,000 - $300,000•Remote

About The Position

Fluidstack is building civilization-scale infrastructure for AI, aiming to deliver 10 to 100s of GWs of compute faster than anyone else. This involves rethinking every layer of the stack, from acquiring power and designing data centers to operating them. The company emphasizes high ownership, autonomy, velocity, first principles thinking, and a passion for the frontier of AI. The Production Engineering Team is focused on building active debugging tooling, an end-to-end network repair pipeline, and a real-time network monitoring platform for a growing fleet across multiple hyperscale datacenter sites.

Requirements

Treat toil as a bug; build tools to automate manual processes like diagnosing link failures.
Think in systems and understand how network issues propagate.
Move toward ambiguity, build maps, and explain them.
Learn at a steep slope and reach competence in unfamiliar domains quickly.
Run incidents, write postmortems, and fix systemic causes.
Be fluent with AI tooling, including LLM APIs, MCP servers, and agentic frameworks.
Have shipped production network tooling or automation that other teams depend on.
Be comfortable in any language using AI coding tools.

Nice To Haves

Network automation and tooling (gNMI, gRPC, NETCONF, SONiC).
Link diagnostics or optical network monitoring.
RMA and repair lifecycle automation.
Large-scale datacenter fabric (BGP, ECMP, spine-leaf).
Out-of-band network management.
Experience with Go or Python.

Responsibilities

Own network fleet health end to end, including defining real-time monitoring requirements, building the alerting lifecycle, and shipping dashboards for network state across all sites.
Build active debugging tooling, such as link diagnostics, remote command execution across the fleet, and repair visualization, to quickly resolve network faults.
Develop automation for the network repair pipeline, from fault detection through parts management and return to service, including ticket integration and lifecycle pipelines.
Own network qualification and validation by building frameworks that gate new sites and hardware into production, defining healthy network criteria before traffic is carried.
Ensure end-to-end reliability, scalability, and operation of the network at-scale through automation, tooling, and incident discipline.