Senior Site Reliability Engineer, AI Factory

Jobgether

2d•$176,000 - $333,500

About The Position

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer, AI Factory in the United States. This role focuses on designing, operating, and optimizing next-generation GPU-accelerated data centers at scale, ensuring performance, reliability, and efficiency for AI workloads. You will lead the end-to-end lifecycle of critical infrastructure, from provisioning and commissioning to day-to-day operations, while collaborating across hardware, software, and operational teams. Success in this position requires deep technical expertise, hands-on problem solving, and a passion for open-source solutions and automation. You will help define operational standards for large-scale AI facilities, drive continuous improvement, and implement processes that maintain uptime while enabling cutting-edge innovation. This role offers the opportunity to impact global AI infrastructure and work in a high-performance, collaborative environment with engineers tackling unique telemetry, orchestration, and reliability challenges.

Requirements

Bachelorâs or Masterâs degree in Computer Engineering, Computer Science, or a related field, or equivalent experience.
10+ years of experience in data center operations, site reliability, or critical infrastructure management.
Proven experience managing GPU fleets and large-scale computing environments.
Expertise in BMS, power management, and commissioning/provisioning processes.
Hands-on experience with configuration management, Packer, QCOW2 images, and Datacenter Inventory Management Systems (Netbox, Nautilus, or similar).
Strong track record of cross-team collaboration to deliver operational excellence and reliability improvements.
Knowledge of automated break-fix solutions, message bus systems, workflow engines, and Zero Touch Provisioning is highly desirable.
Excellent problem-solving skills, attention to detail, and the ability to implement robust processes for uptime and performance optimization.

Responsibilities

Architect, commission, and provision GPU systems at large scale, ensuring supported firmware and component versions are maintained across operations.
Lead Day-2 operations, monitoring cluster hardware, identifying bottlenecks, and optimizing efficiency, performance, and availability.
Triage hardware break-fix issues, develop automated solutions, and continuously improve operational workflows.
Collaborate with hardware, software, and technical teams to define repeatable procedures and operational strategies aligned with SLAs.
Develop and enforce quality control procedures to minimize downtime and maintain high reliability for mission-critical AI infrastructure.
Provide documentation and operational guidance to support global AI data center deployments and internal teams.
Feed hardware and software requirements into engineering pipelines and coordinate with remote hands and field teams.

Benefits

Competitive base salary: $176,000â$276,000 (Level 4) or $208,000â$333,500 (Level 5), based on experience and location.
Equity participation and bonus eligibility.
Comprehensive medical, dental, and vision coverage.
Paid leave, holidays, and flexible work arrangements.
Professional development opportunities and access to learning platforms.
Retirement plans and financial wellness programs.
Collaborative environment with exposure to cutting-edge AI and open-source data center technologies.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume