Director, Infrastructure

Fluidstack
2d$250,000 - $350,000

About The Position

At Fluidstack, we’re building the infrastructure for abundant intelligence. We partner with top AI labs, governments, and enterprises - including Mistral, Poolside, Black Forest Labs, Meta, and more - to unlock compute at the speed of light. We’re working with urgency to make AGI a reality. As such, our team is highly motivated and committed to delivering world-class infrastructure. We treat our customers’ outcomes as our own, taking pride in the systems we build and the trust we earn. If you’re motivated by purpose, obsessed with excellence, and ready to work very hard to accelerate the future of intelligence, join us in building what's next. About the Role Fluidstack is hiring a Director of Infrastructure to own the hardware that powers some of the largest AI clusters in the world. You will lead a team of Networking Engineers, Compute Systems Engineers, and Storage Engineers, and coordinate tightly with Procurement, DC Operations, Software Engineering, SRE, Finance, Security, and Sales to ensure Fluidstack can deliver and clusters faster and operate them more reliably than anyone else in the world. You are expected to be exceptional at both ends of the communication spectrum: technically precise with engineering stakeholders, and credible with customers, partners, and executive stakeholders. You have personally shipped a 10,000+ GPU cluster using current-generation hardware. You know what it takes to bring one up in weeks rather than months, and you have built the tooling, runbooks, and team culture to do it repeatedly.

Requirements

  • 10+ years of infrastructure engineering experience, with at least 3 years in a technical leadership role managing a team of systems, networking, or storage engineers.
  • Demonstrated ownership of the design, deployment, and operation of a 10,000+ GPU cluster using a recent-generation accelerator (Blackwell, Hopper, or equivalent XPU), from physical hardware bring-up through production steady-state.
  • On-site, hands-on experience physically deploying hardware in data centers, with a clear sense of what it takes to execute a fast, reliable cluster bring-up.
  • Deep expertise in high-performance networking for AI workloads: InfiniBand (XDR/NDR) or RoCEv2 fabric design, large-scale BGP and ECMP architectures, and switch and cable plant management.
  • Strong working knowledge of GPU server hardware internals: NVLink and PCIe topology, NVMe configurations, BMC and firmware management.
  • Experience with high-performance parallel and distributed storage systems for AI training workloads, such as DDN/Lustre, WekaFS, VAST, and open source solutions.
  • Exceptional written and verbal communication skills, with the ability to translate between deep technical detail and high-level summaries for engineering, executive, and customer audiences.

Nice To Haves

  • Prior experience at a hyperscaler, neocloud, or GPU OEM in a senior infrastructure or systems engineering role.
  • Experience building and operating bare-metal management tools like MaaS, Netbox, Redfish, including automation of imaging, firmware updates, and hardware lifecycle workflows.
  • Hands-on experience with GPU NPI processes: hardware qualification, acceptance testing, burn-in procedures, and vendor escalation for platform-level defects at cluster scale.
  • Familiarity with current-generation networking products (InfiniBand, RoCE) and the systems-level tradeoffs between them for large-scale AI training and inference.
  • Experience with data center physical infrastructure tradeoffs relevant to GPU-dense deployments: direct liquid cooling, rear-door heat exchangers, high-density PDU and busway configurations, and their impact on cluster layout and availability.
  • An understanding of the software running on these clusters, including Kubernetes, SLURM, PyTorch, and JAX, sufficient to reason about how infrastructure decisions affect workload performance and reliability.
  • Experience representing infrastructure capabilities in customer-facing or commercial contexts, including pre-sales technical diligence with enterprise or government customers.

Responsibilities

  • Own the technical design, deployment, and operational reliability of Fluidstack's bare-metal clusters across all production sites, covering compute, storage, and networking infrastructure.
  • Lead the Infrastructure Engineering organization, comprising Networking Engineers, Compute Systems Engineers, and Storage Engineers, with high standards for technical depth, deployment velocity, and on-call reliability.
  • Drive cluster architecture decisions for current-generation GPU systems (NVIDIA, AMD, and other XPUs), including server configuration, frontend and backend fabric design, storage topology, and rack power and cooling envelope.
  • Coordinate with Supply Chain on OEM relationships, hardware specifications, and delivery timelines to ensure the physical infrastructure roadmap stays one step ahead of customer commitments.
  • Partner with Data Center Operations on new site bring-ups, ensuring smooth handoff from civil and MEP completion through network cabling, hardware racking, burn-in, and customer acceptance testing.
  • Work with Software Engineering and SRE to define infrastructure requirements for managed Kubernetes, SLURM, and inference serving, ensuring the physical layer meets the demands of the software stack.
  • Build and maintain deployment tooling, burn-in automation, and hardware lifecycle management systems that enable your team to operate at a pace and reliability level that sets Fluidstack apart.
  • Stay hands-on: participate in design reviews, be present for critical cluster bring-ups, and engage directly with complex infrastructure failures to maintain technical credibility with your team and across the organization.
  • Travel as needed to data centers, OEM facilities, customer sites, and industry events to stay close to the hardware, the partners, and the market.
  • Coordinate with Finance on infrastructure CapEx planning and cost modeling, with Security on hardening and compliance requirements, and with Sales on pre-sales technical diligence and capacity commitments to customers.

Benefits

  • Competitive total compensation package (salary + equity).
  • Retirement or pension plan, in line with local norms.
  • Health, dental, and vision insurance.
  • Generous PTO policy, in line with local norms.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service