Senior Site Reliability Engineer, AI Factory

Jobgether
2d$176,000 - $333,500

About The Position

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer, AI Factory in the United States. This role focuses on designing, operating, and optimizing next-generation GPU-accelerated data centers at scale, ensuring performance, reliability, and efficiency for AI workloads. You will lead the end-to-end lifecycle of critical infrastructure, from provisioning and commissioning to day-to-day operations, while collaborating across hardware, software, and operational teams. Success in this position requires deep technical expertise, hands-on problem solving, and a passion for open-source solutions and automation. You will help define operational standards for large-scale AI facilities, drive continuous improvement, and implement processes that maintain uptime while enabling cutting-edge innovation. This role offers the opportunity to impact global AI infrastructure and work in a high-performance, collaborative environment with engineers tackling unique telemetry, orchestration, and reliability challenges.

Requirements

  • Bachelor’s or Master’s degree in Computer Engineering, Computer Science, or a related field, or equivalent experience.
  • 10+ years of experience in data center operations, site reliability, or critical infrastructure management.
  • Proven experience managing GPU fleets and large-scale computing environments.
  • Expertise in BMS, power management, and commissioning/provisioning processes.
  • Hands-on experience with configuration management, Packer, QCOW2 images, and Datacenter Inventory Management Systems (Netbox, Nautilus, or similar).
  • Strong track record of cross-team collaboration to deliver operational excellence and reliability improvements.
  • Knowledge of automated break-fix solutions, message bus systems, workflow engines, and Zero Touch Provisioning is highly desirable.
  • Excellent problem-solving skills, attention to detail, and the ability to implement robust processes for uptime and performance optimization.

Responsibilities

  • Architect, commission, and provision GPU systems at large scale, ensuring supported firmware and component versions are maintained across operations.
  • Lead Day-2 operations, monitoring cluster hardware, identifying bottlenecks, and optimizing efficiency, performance, and availability.
  • Triage hardware break-fix issues, develop automated solutions, and continuously improve operational workflows.
  • Collaborate with hardware, software, and technical teams to define repeatable procedures and operational strategies aligned with SLAs.
  • Develop and enforce quality control procedures to minimize downtime and maintain high reliability for mission-critical AI infrastructure.
  • Provide documentation and operational guidance to support global AI data center deployments and internal teams.
  • Feed hardware and software requirements into engineering pipelines and coordinate with remote hands and field teams.

Benefits

  • Competitive base salary: $176,000–$276,000 (Level 4) or $208,000–$333,500 (Level 5), based on experience and location.
  • Equity participation and bonus eligibility.
  • Comprehensive medical, dental, and vision coverage.
  • Paid leave, holidays, and flexible work arrangements.
  • Professional development opportunities and access to learning platforms.
  • Retirement plans and financial wellness programs.
  • Collaborative environment with exposure to cutting-edge AI and open-source data center technologies.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service