System Software Engineer

xAIPalo Alto, CA
8d

About The Position

As a Data Center System Software Engineer at xAI, you will play a pivotal role in ensuring the reliability, scalability, and performance of our state-of-the-art data center infrastructure, including the Colossus supercluster in Memphis—the world's largest AI training cluster with over 100,000 liquid-cooled Nvidia GPUs and plans for expansion to 1 million. This infrastructure powers advanced AI workloads, massive-scale model training, and products like Grok, enabling breakthroughs in understanding the universe. You will collaborate with cross-functional teams to automate operations, enhance observability, and maintain high availability for large-scale distributed systems. This is a hands-on technical position in a dynamic environment, offering the opportunity to tackle complex challenges at the intersection of AI, data center operations, and software reliability.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent experience).
  • 5+ years in site reliability engineering, data center operations, or large-scale infrastructure management.
  • Expert-level knowledge of Kubernetes (on-prem and cloud), infrastructure-as-code tools (Pulumi, Terraform), and CI/CD systems (Buildkite, ArgoCD).
  • Proficiency in at least one systems programming language (Rust, C++, Go) and strong scripting/automation skills.
  • Deep understanding of monitoring and observability technologies.
  • Strong troubleshooting skills across hardware, networking, and distributed software systems.
  • Proven experience with incident response, including on-call rotations, rapid incident resolution, root cause analysis, and implementation of preventative measures.
  • Excellent communication and documentation skills, with the ability to share knowledge concisely and accurately.

Nice To Haves

  • Experience supporting AI/ML workloads or high-density compute environments, including large-scale GPU clusters and HPC systems.
  • Familiarity with data center electrical, cooling, and network systems, such as liquid-cooling and high-bandwidth interconnects.
  • Certifications in Kubernetes, or data center operations.
  • Experience with both on-premises and cloud infrastructure at scale.

Responsibilities

  • Maintain and improve the reliability and uptime of xAI’s on-premises and cloud-based data center environments, including high-density GPU clusters for AI training.
  • Design, implement, and manage monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, PagerDuty).
  • Develop and maintain infrastructure-as-code (Pulumi, Terraform) and continuous deployment pipelines (Buildkite, ArgoCD).
  • Participate in on-call rotations, respond to incidents, perform root cause analysis, and drive post-mortem processes.
  • Analyze system performance, forecast capacity needs, and optimize resource utilization for massive AI/ML workloads.
  • Collaborate with hardware, networking, and software engineering teams to design and implement resilient, scalable solutions, such as RDMA fabrics and liquid-cooling systems.
  • Create and maintain documentation and standard operating procedures.
  • Contribute to the efficiency of AI training pipelines by identifying and mitigating bottlenecks in compute, storage, and networking at unprecedented scales.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service