HPC / AI Software Infrastructure Lead (E)

KLA•Ann Arbor, MI

12d•Onsite

About The Position

HPC/AI Software Infrastructure Leads are core to KLA’s technology. At KLA, we’re pushing the boundaries of semiconductor inspection through advanced AI and high-performance computing. We are looking for a hands-on technical leader to architect and scale the next generation of AI/HPC infrastructure powering our most critical imaging and data platforms. This role is ideal for someone who thrives at the intersection of distributed systems, GPU computing, and real-world AI workloads, and who enjoys building and mentoring high-performing engineering teams while driving technical excellence.

Requirements

10+ years in software engineering, including leading and scaling technical teams
Proven success building distributed systems in HPC, AI/ML, or cloud-native environments
Track record of delivering performance-critical infrastructure at scale
Experience mentoring and growing early- and mid-career engineers
Deep understanding of distributed systems, parallel computing, and Linux systems programming
Strong programming skills in C++, Python, or similar systems-level languages
Experience with GPU computing (CUDA, ROCm) and modern AI frameworks (PyTorch, TensorFlow, etc.)
Familiarity with high-performance storage systems, networking, and data pipelines
Strong foundation in CI/CD, DevOps, and production system reliability
Doctorate (Academic) Degree and related work experience of 5 years; Master's Level Degree and related work experience of 8 years; Bachelor's Level Degree and related work experience of 12 years

Nice To Haves

Background in image processing, computer vision, or scientific computing
Experience supporting hybrid HPC + AI workloads in production environments
Passion for developing talent and building inclusive, high-performing teams
Ability to operate as both a hands-on engineer and strategic technical leader
Strong communication skills with the ability to influence across engineering and product stakeholders

Responsibilities

Lead the architecture and development of large-scale HPC and AI infrastructure supporting cutting-edge image processing and machine learning workloads
Design scalable, high-performance distributed systems that unify traditional image processing with modern AI/Deep Learning pipelines
Drive GPU-accelerated computing strategies, optimizing performance across compute, storage, and networking layers
Partner cross-functionally with hardware, algorithms, and product teams to deliver robust, production-ready platforms
Establish engineering best practices (code quality, CI/CD, observability, performance tuning) for mission-critical systems
Mentor and develop engineers, providing technical guidance, coaching, and growth opportunities for junior team members
Serve as a technical leader and decision-maker, influencing architecture and long-term platform strategy