HPC / AI Software Infrastructure Lead (E)

KLAAnn Arbor, MI
Onsite

About The Position

HPC/AI Software Infrastructure Leads are core to KLA’s technology. At KLA, we’re pushing the boundaries of semiconductor inspection through advanced AI and high-performance computing. We are looking for a hands-on technical leader to architect and scale the next generation of AI/HPC infrastructure powering our most critical imaging and data platforms. This role is ideal for someone who thrives at the intersection of distributed systems, GPU computing, and real-world AI workloads, and who enjoys building and mentoring high-performing engineering teams while driving technical excellence.

Requirements

  • 10+ years in software engineering, including leading and scaling technical teams
  • Proven success building distributed systems in HPC, AI/ML, or cloud-native environments
  • Track record of delivering performance-critical infrastructure at scale
  • Experience mentoring and growing early- and mid-career engineers
  • Deep understanding of distributed systems, parallel computing, and Linux systems programming
  • Strong programming skills in C++, Python, or similar systems-level languages
  • Experience with GPU computing (CUDA, ROCm) and modern AI frameworks (PyTorch, TensorFlow, etc.)
  • Familiarity with high-performance storage systems, networking, and data pipelines
  • Strong foundation in CI/CD, DevOps, and production system reliability
  • Doctorate (Academic) Degree and related work experience of 5 years; Master's Level Degree and related work experience of 8 years; Bachelor's Level Degree and related work experience of 12 years

Nice To Haves

  • Background in image processing, computer vision, or scientific computing
  • Experience supporting hybrid HPC + AI workloads in production environments
  • Passion for developing talent and building inclusive, high-performing teams
  • Ability to operate as both a hands-on engineer and strategic technical leader
  • Strong communication skills with the ability to influence across engineering and product stakeholders

Responsibilities

  • Lead the architecture and development of large-scale HPC and AI infrastructure supporting cutting-edge image processing and machine learning workloads
  • Design scalable, high-performance distributed systems that unify traditional image processing with modern AI/Deep Learning pipelines
  • Drive GPU-accelerated computing strategies, optimizing performance across compute, storage, and networking layers
  • Partner cross-functionally with hardware, algorithms, and product teams to deliver robust, production-ready platforms
  • Establish engineering best practices (code quality, CI/CD, observability, performance tuning) for mission-critical systems
  • Mentor and develop engineers, providing technical guidance, coaching, and growth opportunities for junior team members
  • Serve as a technical leader and decision-maker, influencing architecture and long-term platform strategy

Benefits

  • medical
  • dental
  • vision
  • life
  • 401(K) including company matching
  • employee stock purchase program (ESPP)
  • student debt assistance
  • tuition reimbursement program
  • development and career growth opportunities and programs
  • financial planning benefits
  • wellness benefits including an employee assistance program (EAP)
  • paid time off
  • paid company holidays
  • family care and bonding leave
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service