AI Ops Engineer

KLAAnn Arbor, MI
21h

About The Position

KLA is a global leader in diversified electronics for the semiconductor manufacturing ecosystem. Virtually every electronic device in the world is produced using our technologies. No laptop, smartphone, wearable device, voice-controlled gadget, flexible screen, VR device or smart car would have made it into your hands without us. KLA invents systems and solutions for the manufacturing of wafers and reticles, integrated circuits, packaging, printed circuit boards and flat panel displays. The innovative ideas and devices that are advancing humanity all begin with inspiration, research and development. KLA focuses more than average on innovation and we invest 15% of sales back into R&D. Our expert teams of physicists, engineers, data scientists and problem-solvers work together with the world’s leading technology providers to accelerate the delivery of tomorrow’s electronic devices. Life here is exciting and our teams thrive on tackling really hard problems. There is never a dull moment with us. Group/Division The KLA Services team headquartered in Milpitas, CA is our service organization that consists of Service Sales and Marketing, Spares Supply Chain management, Field Operations, Engineering, Product Training, and Technical Support. The KLA Services organization partners with our field teams and customers in all business sectors to maintain the high performance and productivity of our products through a flexible portfolio of services. Our comprehensive services include: proactive management of tools to identify and improve performance; expertise in optics, image processing and motion control with worldwide service engineers, 24/7 technical support teams and knowledge management systems; and an extensive parts network to ensure worldwide availability of parts. We seek a highly skilled and passionate Senior AI Ops Engineer to join our team. This role will be pivotal in architecting and delivering the automation layer that enables fast, reproducible, and scalable model development—spanning end-to-end experiment management, model fine-tuning pipelines, and Reinforcement Learning with Human Feedback (RLHF). We encourage you to apply if you’re a systems-minded engineer who loves turning research workflows into reliable production-grade pipelines, setting standards, and mentoring others to raise the bar across the organization.

Requirements

  • Strong proficiency in Python and experience building robust automation frameworks and production-grade services for ML workloads
  • Hands-on experience with experiment tracking and model lifecycle tooling (e.g., MLflow, Weights & Biases) and reproducible ML workflows
  • Practical experience fine-tuning modern deep learning models (e.g., Transformers) and familiarity with parameter-efficient approaches (LoRA/PEFT)
  • Working knowledge of RLHF concepts and pipelines (preference data, reward models, policy optimization) and how to operationalize human-in-the-loop workflows.
  • Experience with containerization (Docker), orchestration (Kubernetes), and operating GPU workloads reliably at scale.
  • Experience with CI/CD, version control (Git), and Infrastructure-as-Code (Terraform/Bicep or equivalent).
  • Excellent problem-solving skills across distributed systems (training jobs, pipelines, compute infrastructure) and strong communication to partner with research and engineering teams.
  • Bachelor’s degree in Computer Science, Software Engineering, or related field
  • 5+ years of experience in MLOps/Platform Engineering/DevOps/ML Engineering (or demonstrated equivalent impact), including owning production systems and leading cross-team initiatives

Nice To Haves

  • Prior experience in a similar industry and/or operating ML platforms with stringent IP/security requirements is a plus.

Responsibilities

  • Implement and operate experiment tracking, lineage, and reproducibility standards (datasets, code, configs, artifacts, metrics) using MLflow/W&B or equivalents.
  • Build CI/CD for ML: tests (unit/integration), packaging, reproducibility checks, policy gates, automated deployment and rollback strategies.
  • Design workflow orchestration for large-scale ML jobs (scheduled runs, triggered retrains, parameter sweeps, gated releases) using tools such as Airflow/Kubeflow/Argo or equivalents.
  • Architect, build, and own automated pipelines for model training, fine-tuning (e.g., PEFT/LoRA), evaluation, and promotion across environments (dev → staging → production).
  • Establish standardized training “recipes” (configs, templates, golden paths) to reduce time-to-first-experiment and improve consistency across teams.
  • Enable and optimize distributed GPU training (throughput, reliability, and cost), including checkpointing, mixed precision, fault tolerance, and spot/preemptible handling where applicable.
  • Develop evaluation harnesses and automated benchmark suites (quality, safety, latency, and cost) with clear, repeatable reporting to compare runs and releases.

Benefits

  • medical
  • dental
  • vision
  • life, and other voluntary benefits
  • 401(K) including company matching
  • employee stock purchase program (ESPP)
  • student debt assistance
  • tuition reimbursement program
  • development and career growth opportunities and programs
  • financial planning benefits
  • wellness benefits including an employee assistance program (EAP)
  • paid time off and paid company holidays
  • family care and bonding leave
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service