Senior Network Engineer

TensorWaveLas Vegas, NV
5d

About The Position

We’re seeking a Senior Network Engineer focused on implementing and operating large-scale, Arista-based RoCEv2 data center networks powering next generation AI and ML infrastructure. You’ll work hand-in-hand with our network architect to design the infrastructure that keeps over 8,000 GPUs burring, and play a critical role in the implementation and maintenance of our next generation systems, with cluster sizes reaching over 100,000 GPUs. You’ll work hands-on with high-speed optics, switching, and routing in production clusters and implement modern automation and tooling critical to how the network is deployed, validated, and operated.

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering, or a related technical field, or equivalent practical experience
  • Deep experience with RDMA and RoCEv2 in large-scale production data centers supporting AI or HPC workloads
  • Strong Arista expertise, including EOS, hardware platforms, and operating high-speed Ethernet fabrics
  • Proven knowledge of congestion management and performance tuning using PFC, ECN, and DCQCN
  • Hands-on experience with high-speed optics and cabling including 400G, 800G, and AEC, AOC, DAC, and structured cabling in dense environments
  • Automation and operations mindset, with experience using Python, Ansible, Terraform, Git, and observability tooling in always-on production systems

Responsibilities

  • As a senior engineer, you are responsible for designing systems, driving technical direction, and mentoring other engineers, with demonstrable examples of architectures you’ve owned and teams you’ve influenced
  • Design, deploy, and operate large-scale Arista-based RoCEv2 data center networks supporting AI and ML clusters from thousands to 100,000+ GPUs
  • Own congestion management and performance tuning across RDMA fabrics, including PFC, ECN, and DCQCN, in production environments
  • Implement and maintain automation, validation, and observability tooling using Python, Ansible, Terraform, and modern DevOps workflows
  • Ensure high availability and reliability across multi-tenant environments by leading operational excellence, incident response, and continuous improvement

Benefits

  • Stock Options
  • 100% paid Medical, Dental, and Vision insurance
  • Life and Voluntary Supplemental Insurance
  • Short Term Disability Insurance
  • Flexible Spending Account
  • 401(k)
  • Flexible PTO
  • Paid Holidays
  • Parental Leave
  • Mental Health Benefits through Spring Health
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service