xAI is seeking a highly skilled Member of Technical Staff to join their team in managing and enhancing reliability across a multi-data center environment. This role focuses on automating processes, building and implementing robust observability solutions, and ensuring seamless operations for mission-critical AI infrastructure. The ideal candidate will combine strong coding abilities with hands-on data center experience to build scalable reliability services, optimize system performance, and minimize downtime—including close partnership with facility operations to address physical infrastructure impacts. This position plays a pivotal role in bridging software engineering principles with physical data center realities. By prioritizing automation and observability, team members in this role can reduce mean time to recovery (MTTR) by up to 50% through proactive monitoring and automated remediation. The primary objective of this team is to mitigate downtime and minimize impact to end-users from both scheduled and unscheduled maintenance, as well as events affecting onsite data centers. This is achieved through proactive automation, robust observability, and integrated software-physical reliability strategies, ensuring our AI infrastructure remains resilient, scalable, and at the cutting edge of innovation.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Education Level
Associate degree