About The Position

The AI2NE Org strives to be global leaders in the RDMA cluster networking domain and enable seamless, accelerated High-Performance Compute (HPC), Artificial Intelligence and Machine Learning advancements. We envision a future where artificial intelligence and machine learning revolutionize industries, reshape societies, and unlock limitless possibilities. Our vision is to be a pioneering force, driving the development and design of state-of-the-art RDMA clusters tailored specifically for AI, ML, HPC workloads. We strive to be the go-to experts in RDMA cluster architecture, leveraging our deep understanding of the unique demands of AI/ML and HPC applications. By staying at the forefront of technological advancements, we aim to redefine the boundaries of what is possible, pushing the envelope of computational capabilities and unlocking unprecedented performance. We're looking for a hands-on leader with strong management experience to help us build new features and grow our team. The role will be leading a team of network development engineers in a fast-paced environment that requires agility and the drive to deliver. The team will be responsible for provisioning, securing, scaling & operating the network stack required to run distributed AI workloads across a cluster spanning thousands of GPUs. The candidate should be comfortable with building complex distributed systems involving the management and control of hundreds of thousands of network devices.

Requirements

  • BS or MS in Computer Science, Network/Electrical Engineering, or equivalent experience.
  • Minimum 7+ years of experience in large-scale physical network support with 2-4 years of people management experience.
  • Minimum 2+ years of experience in in Network Development/Deployment at scale.
  • Experience building scalable, cloud-native distributed systems.
  • Ability to work in a collaborative, cross-functional team environment.
  • Solid understanding of key networking technologies needed for the cloud including: network design and fabrics, networking protocols, network automation, network telemetry and common hardware platforms
  • Ability to effectively communicate technical ideas verbally and in writing.
  • Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services.
  • Experience with production operations and best practices for deploying code in production and troubleshooting issues when they arise.

Responsibilities

  • Leading a team of network development engineers
  • Provisioning, securing, scaling & operating the network stack required to run distributed AI workloads across a cluster spanning thousands of GPUs
  • Building complex distributed systems involving the management and control of hundreds of thousands of network devices

Benefits

  • Oracle careers open the door to global opportunities where work-life balance flourishes.
  • We offer competitive benefits based on parity and consistency and support our people with flexible medical, life insurance, and retirement options.
  • We also encourage employees to give back to their communities through our volunteer programs.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Manager

Industry

Publishing Industries

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service