Network Development Engineer (Ops&Deploy) -xAI Networking

xAI•Palo Alto, CA

55d

About The Position

xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company's mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates. xAI is building at a furious pace with the latest hardware to help people understand the universe and we are in need of Network Development Engineers (NDEs) with at least 3+ years of experience in deploying or operating large scale production Data Center or Backbone networks. You will own the availability and/or the deployment of production networks for X and xAI, including Data Center, Backbone networks and our primary front and backend networks that train Grok and our customers use for inference. Deployment Engineers will own all aspects of planning and building of green and brownfield network deployments. Operations Engineers will own timely mitigation of network impairments for all layers of our network and the return to service of Network HW and capacity You will be expected to participate in a team oncall rota and to contribute to scaling and maintenance efforts.

Requirements

A minimum of 3 years in deploying or operating hyper scale networks
Hands-on experience with networking protocols and tools (e.g., BGP, OSPF, ZTP etc.).
Experience with Python scripting and in automating tasks, acquiring metrics, and analyzing large data sets.
Strong problem-solving skills and ability to thrive in a fast-paced, ambiguous setting.
Bachelor's degree in Computer Science, Electrical Engineering, or a related field (or equivalent experience).

Nice To Haves

Experience designing hyper scale network infrastructure or large-scale GPU clusters and automating their entire deployment process.
Proven track record in leading on-call rotations, incident response, and team development in high-stakes environments.
A working understanding of RoCEv2.

Responsibilities

Deploying or Operating scalable network architectures for AI/HPC workloads, inter-DC and Backbone network fabrics.
Power user and ability to iterate SW and toolings for network operations, network deployment and monitoring.
Collaborating with cross-functional teams on data center & backbone buildouts and optimizations.
Analyzing performance and availability metrics to identify and resolve bottlenecks, availability impairments or inefficient build processes.
Ensuring high availability, fast deployability and high security of production networks.