Remote)

Genesis AI

160d

About The Position

In this role, you will be responsible for driving down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack, from data pipelines to GPU kernels. You will design, build, and optimize distributed training systems using PyTorch for multi-node GPU clusters, ensuring scalability, robustness, and high utilization. Additionally, you will implement efficient low-level code using CUDA, cuDNN, Triton, and custom kernels, integrating it seamlessly into high-level training frameworks. Your work will also involve optimizing workloads for hardware efficiency, focusing on CPU/GPU compute balance, memory management, data throughput, and networking. Furthermore, you will develop monitoring and debugging tools for large-scale runs, enabling rapid diagnosis of performance regressions and failures.

Requirements

Deep experience in distributed systems, ML infrastructure, or high-performance computing (8+ years).
Production-grade expertise in Python.
Low-level performance mastery: CUDA/cuDNN/Triton, CPU–GPU interactions, data movement, and kernel optimization.
Experience with PyTorch and training jobs using data, context, pipeline, and model parallelism.
System-level mindset with a track record of tuning hardware–software interactions for maximum utilization.

Responsibilities

Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack.
Design, build, and optimize distributed training systems (PyTorch) for multi-node GPU clusters, ensuring scalability, robustness, and high utilization.
Implement efficient low-level code (CUDA, cuDNN, Triton, custom kernels) and integrate it seamlessly into high-level training frameworks.
Optimize workloads for hardware efficiency: CPU/GPU compute balance, memory management, data throughput, and networking.
Develop monitoring and debugging tools for large-scale runs, enabling rapid diagnosis of performance regressions and failures.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Career Level

Senior

Staff Software Engineer, Training (Bay Area / Paris / Remote)

About The Position

Requirements

Responsibilities

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company