Senior / Staff Network Reliability Engineer

FluidStackSan Francisco, CA
75d

About The Position

Our Network Reliability Engineers are the backbone of Fluidstack's platform. You'll utilize deep networking expertise and software engineering to keep our high-performance network fabrics fast, reliable and cost-efficient at scale. Our NREs operate RDMA fabrics, the datacenter network, and our WAN backbones.

Requirements

  • 7+ yrs in network-heavy SRE, performance engineering or data-center networking.
  • Mastery of Linux networking stack and protocol-level debugging (TCP, IB, RoCE).
  • Production experience with many vendors (Mellanox/NVIDIA, Arista, Juniper, etc.), multi-layer fabrics, and network overlays (VXLAN, Geneve).
  • Fluency in Python, Go or Rust; solid Infra-as-Code & CI/CD chops.
  • Familiarity with DPDK, XDP, eBPF and InfiniBand/RoCE.
  • Proven track record scaling low-latency, high-throughput networks for AI/ML or HPC clusters.

Responsibilities

  • Super-charge the network stack. Tune TCP/IP, RDMA (primarily RoCE congestion control), kernel-bypass frameworks (DPDK, XDP, eBPF) and NIC offloads to squeeze microseconds off packet latency for AI & HPC workloads.
  • Deploy & optimize at scale. Roll out new ToR/spine switches (from NVIDIA, Arista, Juniper, and others), validate SmartNIC and BlueField networking, configure BGP/EVPN fabrics, and optimize flow control (PFC, ECN) for zero-loss transport.
  • Automate observability. Build NIC-to-orchestrator telemetry pipelines, packet-loss detection bots, and real-time throughput/latency dashboards.
  • Root-cause the gnarly stuff. Lead packet captures, congestion analyses and latency regressions; turn insights into switch firmware patches, kernel tuning and topology optimizations.
  • Drive vendor collaboration. Pair with networking vendors to debug hardware, accelerate RDMA paths, validate optics, and integrate emerging network hardware (800G/1.6T, LPO/CPO).
  • Continuously improve. Inject link failures, run game-days simulating network partitions and codify post-mortem learnings into SLIs/SLOs that matter to customers.

Benefits

  • Competitive total compensation package (cash + equity).
  • Retirement or pension plan, in line with local norms.
  • Health, dental, and vision insurance.
  • Generous PTO policy, in line with local norms.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service