GPU Software Engineer

KLAMilpitas, CA
7h

About The Position

KLA is a global leader in diversified electronics for the semiconductor manufacturing ecosystem. Virtually every electronic device in the world is produced using our technologies. No laptop, smartphone, wearable device, voice-controlled gadget, flexible screen, VR device or smart car would have made it into your hands without us. KLA invents systems and solutions for the manufacturing of wafers and reticles, integrated circuits, packaging, printed circuit boards and flat panel displays. The innovative ideas and devices that are advancing humanity all begin with inspiration, research and development. KLA focuses more than average on innovation and we invest 15% of sales back into R&D. Our expert teams of physicists, engineers, data scientists and problem-solvers work together with the world’s leading technology providers to accelerate the delivery of tomorrow’s electronic devices. Life here is exciting and our teams thrive on tackling really hard problems. There is never a dull moment with us. Group/Division Enabling the movement toward advanced chip design, KLA's Measurement, Analytics and Control group (MACH) is looking for the best and brightest research scientists, software engineers, application development engineers and senior product technology process engineers to join our team. The MACH team's mission is to collaborate with our customers to innovate technologies and solutions that detect and control highly complex process variations—at their source—rather than compensate for them at later stages of the manufacturing process. With over 40 years of semiconductor process control experience, chipmakers around the globe rely on KLA to ensure that their fabs ramp next-generation devices to volume production quickly and cost-effectively. Our MACH team develops leading-edge solutions for patterning process analytics and control technologies, thereby providing customers with critical insight at the feature level, field level and cross-wafer analysis. Our teams also develop advanced modeling simulation, data analytics and process control modeling technologies. As a member of the MACH team, you’ll be joining the most sophisticated and successful process-control company in the semiconductor industry--working across functions to solve the most complex technical problems in the digital age.

Requirements

  • BS/MS in CS, EE, or related field.
  • 3–6 years of professional experience, including 2+ years focused on CUDA-based image processing.
  • Strong C++17/20 fundamentals; solid understanding of parallel algorithms and data layouts (pitch-linear, planar, interleaved).
  • Practical experience with Nsight profiling, occupancy analysis, and kernel optimization (tiling, warp-level intrinsics, streams).
  • Experience with OpenCV (including CUDA paths), and at least one of: NPP, cuFFT, cuBLAS.
  • Comfortable on Linux; CMake, Git, code reviews, and automated testing.

Nice To Haves

  • DMA/zero-copy, pinned memory.
  • Texture memory and surface writes for image sampling; CUDA Graphs and stream concurrency.
  • Familiarity with image feature extraction and classification
  • Exposure to ML inference acceleration (TensorRT/cuDNN) for CV tasks is a plus, but core focus is classical image processing.
  • Experience with performance modeling (roofline), and multi-GPU awareness (NCCL) is a bonus.
  • Experience with sharding, Tensor parallelism, etc

Responsibilities

  • Implement and optimize CUDA kernels for image operations: convolution/filters, morphological ops, warping/resampling, color space conversions (RGB/YUV/HSV), denoising/deblurring, HDR/Tone mapping, polygon manipulation, feature extraction, and classification.
  • Use GPU memory hierarchies effectively (global/shared/constant/texture), coalesce memory, apply shared memory tiling, and minimize divergence/branching.
  • Profile and tune with Nsight Compute/Systems, CUDA-MEMCHECK, and cuda-gdb; instrument pipelines with metrics (FPS, latency, bandwidth, occupancy).
  • Collaborate with product and algorithm teams; contribute to CI/CD (Azure/DevOps, CMake, GitHub Actions/GitLab CI) and documentation.
  • Integrate accelerated primitives (NVIDIA NPP, cuFFT, cuBLAS) and OpenCV CUDA modules; build clean C++ APIs with Python bindings (pybind11) when needed.
  • implement a distributed multi-process architecture using CUDA MPS for high-throughput, concurrent workloads.
  • You’ll own performance-critical pipelines, profile on NVIDIA GPUs, and ship production-quality C++ that meets strict latency and throughput targets

Benefits

  • medical
  • dental
  • vision
  • life, and other voluntary benefits
  • 401(K) including company matching
  • employee stock purchase program (ESPP)
  • student debt assistance
  • tuition reimbursement program
  • development and career growth opportunities and programs
  • financial planning benefits
  • wellness benefits including an employee assistance program (EAP)
  • paid time off and paid company holidays
  • family care and bonding leave
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service