Senior Software Engineer – AI Middleware

Cornelis Networks, Inc.•Austin, TX

51d•Remote

About The Position

Cornelis Networks delivers the world’s highest performance scale-out networking solutions for AI and HPC datacenters. Our differentiated architecture seamlessly integrates hardware, software and system level technologies to maximize the efficiency of GPU, CPU and accelerator-based compute clusters at any scale. Our solutions drive breakthroughs in AI & HPC workloads, empowering our customers to push the boundaries of innovation. Backed by top-tier venture capital and strategic investors, we are committed to innovation, performance and scalability - solving the world’s most demanding computational challenges with our next-generation networking solutions. We are a fast-growing, forward-thinking team of architects, engineers, and business professionals with a proven track record of building successful products and companies. As a global organization, our team spans multiple U.S. states and six countries, and we continue to expand with exceptional talent in onsite, hybrid, and fully remote roles. We are seeking a highly experienced Senior Software Engineer to design, develop, and upstream-enable Cornelis Networks’ AI communication middleware. This role focuses on distributed AI workloads and enabling/optimizing collective communication libraries (e.g., NCCL/RCCL) over Cornelis Networks’ interconnects.

Requirements

8+ years of experience in high-performance systems programming in C/C++ on Linux.
Strong experience with GPU communication stacks including CUDA/ROCm and NCCL/RCCL.
Ability to optimize distributed training performance using profiling and tracing.
Understanding of collective communication concepts and topology awareness.
Experience delivering production-quality code.
Open-source contributions in relevant areas.

Nice To Haves

Experience with AI frameworks such as PyTorch Distributed, DeepSpeed, and Megatron-LM.
Familiarity with libfabric/OFI, UCX, and RDMA concepts.
Experience with RoCEv2 and Ultra Ethernet.
Experience building cluster-scale performance test infrastructure.

Responsibilities

Design and implement performance-critical features for CCL enablement on Cornelis Networks’ fabrics.
Optimize distributed training performance across multi-node, multi-GPU configurations.
Improve GPU communication paths including GPU-direct transfers, IPC, and CPU/GPU synchronization.
Profile distributed AI workloads and identify bottlenecks across the software and hardware stack.
Tune AI frameworks such as PyTorch Distributed, TensorFlow/XLA, JAX, DeepSpeed, and Megatron-LM.
Develop benchmarks and microbenchmarks aligned with real model performance.
Contribute upstream to AI communication and distributed training projects.
Participate in design reviews, code reviews, CI, and long-term maintenance.
Prototype and validate Ultra Ethernet capabilities for AI collective communication.
Provide technical input for deployment considerations and performance validation.
Collaborate with kernel/driver, switch, performance, and systems teams.
Support advanced escalations by analyzing traces and providing robust fixes.

Benefits

We offer a competitive compensation package that includes equity, cash, and incentives, along with health and retirement benefits.
In addition to your base pay, you’ll have access to a broad range of benefits, including medical, dental, and vision coverage, as well as disability and life insurance, a dependent care flexible spending account, accidental injury insurance, and pet insurance.
We also offer generous paid holidays, 401(k) with company match, and Open Time Off (OTO) for regular full-time exempt employees.
Other paid time off benefits include sick time, bonding leave, and pregnancy disability leave.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume