About The Position

We are seeking a Systems Development Engineer to own the research compute platform for Fauna Robotics. You will build and operate the physical and virtual infrastructure that our ML scientists use to train reinforcement learning policies for real robots, from fleet provisioning and job scheduling to cloud burst capacity and environment reproducibility. This role requires both strong systems engineering fundamentals and genuine comfort working alongside researchers. The ideal candidate is as happy diagnosing a GPU thermal fault as they are designing a job scheduler, and treats “the scientist’s training run just works” as the north star for everything they build.

Requirements

  • 3+ years of Linux systems administration experience
  • 3+ years of non-internship professional systems engineering or systems development experience
  • Experience with configuration management and fleet automation (Ansible, Chef, or equivalent)
  • Experience with containerization in production (Docker required; Kubernetes or containered exposure preferred)
  • Proficiency in Python, Go, or Bash for systems tooling and automation
  • Experience with NVIDIA GPU infrastructure: driver management, CUDA versioning, basic GPU diagnostics
  • Experience with job schedulers or orchestrators (Slurm, Ray, SkyPilot, Kubernetes with GPU operator, or equivalent)
  • Hardware comfort: diagnosing and replacing GPUs, PSUs, memory, storage

Nice To Haves

  • NVIDIA deep fluency: DCGM, NVLink / PCIe topology, IOMMU, compute mode configuration
  • Experience with GPU cloud providers (AWS p5/g6e, RunPod, Lambda, CoreWeave) for hybrid on-prem/cloud workflows
  • Track record of building internal platforms that accelerate other engineers or scientists

Responsibilities

  • Own on-prem GPU compute end-to-end: provisioning, imaging, driver and CUDA management, monitoring, failure diagnosis, hardware RMA, and capacity planning
  • Build and operate a job scheduling layer (Slurm, Ray, SkyPilot, or equivalent) so scientists submit training runs without managing individual machines
  • Design and implement the bridge between on-prem and cloud compute
  • Partner directly with ML scientists to triage training issues, profile workloads, identify bottlenecks, and advise on how to structure training for the hardware at hand

Benefits

  • sign-on payments
  • restricted stock units (RSUs)
  • health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service