About The Position

The High Performance Computing and Artificial Intelligence (HPC and AI) team is focused on building the next-generation distributed artificial intelligence supercomputer. Our goal is to enable breakthroughs in artificial intelligence by delivering unmatched computational power, scalability, and reliability. We design and develop advanced infrastructure that supports high-performance model training at scale, laying the groundwork for innovations that expand the boundaries of what artificial intelligence can achieve. We are seeking a Cloud Network Engineer II who is passionate about designing and developing the infrastructure that powers large-scale artificial intelligence and high-performance computing systems. In this role, you will contribute to the design, deployment, and operation of network infrastructure, automation workflows, observability frameworks, and performance optimization systems. These components are essential for achieving ultra-low latency, high throughput, and efficient data movement at petabyte scale in distributed workloads. As a Cloud Network Engineer II on the HPC and AI Infrastructure team, you will work at the intersection of artificial intelligence supercomputing and large-scale networking. Your contributions will directly impact the reliability and performance of distributed clusters, leveraging high-speed fabrics such as Ethernet and InfiniBand, and accelerated compute platforms including NVIDIA and AMD graphics processing units. This is a unique opportunity to help build the network infrastructure that ensures speed, reliability, and high availability at exascale levels, while collaborating across hardware, infrastructure, and platform teams.

Requirements

  • Experience in designing and developing network infrastructure.
  • Knowledge of automation workflows and observability frameworks.
  • Familiarity with performance optimization systems.
  • Experience with high-speed fabrics such as Ethernet and InfiniBand.
  • Knowledge of accelerated compute platforms including NVIDIA and AMD GPUs.

Responsibilities

  • Design, deploy, and operate network infrastructure for large-scale AI and HPC systems.
  • Develop automation workflows and observability frameworks.
  • Optimize performance systems for ultra-low latency and high throughput.
  • Ensure efficient data movement at petabyte scale in distributed workloads.
  • Collaborate with hardware, infrastructure, and platform teams.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service