Staff AI/ML Infrastructure Engineer

Vultr
$145,000 - $160,000

About The Position

Vultr is on a mission to make high-performance cloud infrastructure easy to use, affordable, and locally accessible for enterprises and AI innovators around the world. With 32 global cloud data center locations, Vultr is trusted by hundreds of thousands of active customers across 185 countries for its flexible, scalable, global Cloud Compute, Cloud GPU, Bare Metal, and Cloud Storage solutions. In December 2024 Vultr announced an equity financing at a $3.5 billion valuation. Founded by David Aninowsky and self-funded for over a decade, Vultr has grown to become the world’s largest privately-held cloud infrastructure company. Vultr is seeking a highly skilled and experienced Staff AI/ML Infrastructure Engineer to drive the design, performance, and reliability of our AI infrastructure platform. The ideal candidate is a hands-on infrastructure expert with deep GPU systems knowledge, strong automation experience, and a track record of technical leadership in high-performance environments. This is a highly visible role in a high-growth technology company, requiring ownership of complex hardware and software systems, collaboration across engineering and vendor partners, and a relentless focus on operational excellence. This is your opportunity to build the foundation powering next-generation AI workloads and leave a lasting mark on Vultr and the future of cloud infrastructure.

Requirements

  • 5+ years experience working with bare metal infrastructure and hardware automation
  • Hands-on experience with modern NVIDIA/AMD GPU platforms and high-performance networking (RoCE, InfiniBand)
  • Deep knowledge of BIOS, BMC, firmware, NICs, Redfish/IPMI, and PCIe systems
  • Strong Linux systems experience including device drivers and package management
  • Experience building infrastructure automation using Python and Bash
  • Familiarity with GPU drivers, firmware ecosystems, and vendor collaboration
  • Experience designing and delivering complex infrastructure products
  • Proven ability to lead projects and mentor engineers
  • Experience optimizing multi-cluster GPU environments
  • Exposure to Machine Learning software stacks and GPU workloads

Responsibilities

  • Design and maintain GPU and bare metal infrastructure in containerized and physical environments
  • Build scalable GPU clusters in partnership with networking and provisioning teams
  • Ensure reliable, high-performance provisioning of GPU infrastructure
  • Develop automated testing systems for GPU-based platforms
  • Implement infrastructure solutions for diverse AI/ML workloads
  • Benchmark, test, and troubleshoot GPU performance at scale
  • Collaborate with hardware vendors on drivers, firmware, and support
  • Resolve hardware, software, and performance issues across environments
  • Optimize rail and cluster performance across architectures
  • Lead technical direction and mentor engineers on infrastructure best practices

Benefits

  • 100% company-paid insurance premiums for employee medical, dental and vision plans.
  • 401(k) plan that matches 100% up to 4%, with immediate vesting
  • Professional Development Reimbursement of $2,500 each year
  • 11 Holidays + Paid Time Off Accrual + Rollover Plan
  • Commitment matters to Vultr! Increased PTO at 3 year and 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year
  • $500 stipend for remote office setup in first year + $400 each following year
  • Internet reimbursement up to $75 per month
  • Gym membership reimbursement up to $50 per month
  • Company paid Wellable subscription
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service