Senior Software Engineer – AI Middleware

Cornelis Networks, Inc.Austin, TX
1dRemote

About The Position

Cornelis Networks delivers the world’s highest performance scale-out networking solutions for AI and HPC datacenters. Our differentiated architecture seamlessly integrates hardware, software and system level technologies to maximize the efficiency of GPU, CPU and accelerator-based compute clusters at any scale. Our solutions drive breakthroughs in AI & HPC workloads, empowering our customers to push the boundaries of innovation. Backed by top-tier venture capital and strategic investors, we are committed to innovation, performance and scalability - solving the world’s most demanding computational challenges with our next-generation networking solutions. We are a fast-growing, forward-thinking team of architects, engineers, and business professionals with a proven track record of building successful products and companies. As a global organization, our team spans multiple U.S. states and six countries, and we continue to expand with exceptional talent in onsite, hybrid, and fully remote roles. We are seeking a highly experienced Senior Software Engineer to design, develop, and upstream-enable Cornelis Networks’ AI communication middleware. This role focuses on distributed AI workloads and enabling/optimizing collective communication libraries (e.g., NCCL/RCCL) over Cornelis Networks’ interconnects.

Requirements

  • 8+ years of experience in high-performance systems programming in C/C++ on Linux.
  • Strong experience with GPU communication stacks including CUDA/ROCm and NCCL/RCCL.
  • Ability to optimize distributed training performance using profiling and tracing.
  • Understanding of collective communication concepts and topology awareness.
  • Experience delivering production-quality code.
  • Open-source contributions in relevant areas.

Nice To Haves

  • Experience with AI frameworks such as PyTorch Distributed, DeepSpeed, and Megatron-LM.
  • Familiarity with libfabric/OFI, UCX, and RDMA concepts.
  • Experience with RoCEv2 and Ultra Ethernet.
  • Experience building cluster-scale performance test infrastructure.

Responsibilities

  • Design and implement performance-critical features for CCL enablement on Cornelis Networks’ fabrics.
  • Optimize distributed training performance across multi-node, multi-GPU configurations.
  • Improve GPU communication paths including GPU-direct transfers, IPC, and CPU/GPU synchronization.
  • Profile distributed AI workloads and identify bottlenecks across the software and hardware stack.
  • Tune AI frameworks such as PyTorch Distributed, TensorFlow/XLA, JAX, DeepSpeed, and Megatron-LM.
  • Develop benchmarks and microbenchmarks aligned with real model performance.
  • Contribute upstream to AI communication and distributed training projects.
  • Participate in design reviews, code reviews, CI, and long-term maintenance.
  • Prototype and validate Ultra Ethernet capabilities for AI collective communication.
  • Provide technical input for deployment considerations and performance validation.
  • Collaborate with kernel/driver, switch, performance, and systems teams.
  • Support advanced escalations by analyzing traces and providing robust fixes.

Benefits

  • We offer a competitive compensation package that includes equity, cash, and incentives, along with health and retirement benefits.
  • In addition to your base pay, you’ll have access to a broad range of benefits, including medical, dental, and vision coverage, as well as disability and life insurance, a dependent care flexible spending account, accidental injury insurance, and pet insurance.
  • We also offer generous paid holidays, 401(k) with company match, and Open Time Off (OTO) for regular full-time exempt employees.
  • Other paid time off benefits include sick time, bonding leave, and pregnancy disability leave.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service