This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer, AI Factory in the United States. This role focuses on designing, operating, and optimizing next-generation GPU-accelerated data centers at scale, ensuring performance, reliability, and efficiency for AI workloads. You will lead the end-to-end lifecycle of critical infrastructure, from provisioning and commissioning to day-to-day operations, while collaborating across hardware, software, and operational teams. Success in this position requires deep technical expertise, hands-on problem solving, and a passion for open-source solutions and automation. You will help define operational standards for large-scale AI facilities, drive continuous improvement, and implement processes that maintain uptime while enabling cutting-edge innovation. This role offers the opportunity to impact global AI infrastructure and work in a high-performance, collaborative environment with engineers tackling unique telemetry, orchestration, and reliability challenges.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior