Remote | CUDA & GPU Kernel Optimization Engineer — $70–$90/hour

24-Mag•New York, NY

1d•Remote

About The Position

We are sharing a specialised part-time consulting opportunity for CUDA and GPU programming professionals experienced in kernel optimization, C++ engineering, profiler-guided performance analysis, GPU hardware utilization, and technical review. This role supports current and upcoming remote consulting opportunities focused on GPU kernel optimization, performance evaluation, CUDA/HIP review, profiler metric analysis, C++ and Python workflows, and high-quality project execution. Selected professionals will apply their GPU programming expertise to analyze kernels, identify performance bottlenecks, improve implementation quality, and document optimization decisions across modern hardware environments.

Requirements

Strong practical experience with GPU programming and kernel optimization
Fluency in core C++ features through C++17
Working knowledge of Python and Git
Fluency in at least one GPU programming model, such as CUDA, HIP, Slang, HLSL, GLSL, or related kernel programming
At least 1 year of professional or graduate-level research experience working with GPUs
Strong understanding of GPU profiler performance metrics and how to use them to optimize kernels
Ability to work independently on technical review and optimization tasks
Availability to work at least 20 hours per week depending on project scope

Nice To Haves

Experience with CUDA, HIP, CUDA C++ Core Libraries, inline PTX assembly, or tensor core-level optimization
Experience optimizing kernels for NVIDIA Blackwell hardware or other modern GPU architectures
Familiarity with Nsight Compute or comparable GPU profiling tools
Prior experience with GPU hardware organizations such as NVIDIA, AMD, Qualcomm, or similar technical environments
Open-source contributions related to GPU kernel optimization, HPC, compiler tooling, graphics, or performance engineering

Responsibilities

Analyze and optimize GPU kernels for performance, efficiency, and hardware utilization
Review kernel implementations and identify bottlenecks in memory access, occupancy, throughput, or execution patterns
Improve performance outcomes using CUDA, HIP, shader programming, or related GPU programming models
Optimize kernels even when limited background context is available for the underlying algorithm
Use profiler metrics such as L2 cache hit rate, L2 throughput, occupancy, memory behavior, and related performance signals
Evaluate when specific profiler metrics are useful, misleading, or secondary to other optimization factors
Document optimization decisions clearly and explain tradeoffs in technical terms
Calibrate performance judgments against structured benchmarks, hardware constraints, and project-specific criteria
Write, modify, and reason about C++17, Python, and GPU programming code
Review code for correctness, performance impact, maintainability, and optimization potential
Use Git-based workflows to manage technical materials and project submissions
Apply practical GPU programming expertise across CUDA, HIP, Slang, HLSL, GLSL, or related kernel programming environments