About The Position

Are you passionate about Generative AI? Are you interested in working on groundbreaking generative modeling technologies to enrich billions of people? We are the Intelligence System Experience (ISE) team within Apple's software organization. The team operates at the intersection of multimodal machine learning and system experiences. Our multidisciplinary ML teams focus on a broad spectrum of areas, including Visual Generative Foundation Models, Multimodal Understanding, Visual Understanding of People, Text, Handwriting, and Scenes, Personalization, Knowledge Extraction, Conversation Analysis, Behavioral Modeling for Proactive Suggestions, and Privacy-Preserving Learning. These innovations form the foundation of the seamless, intelligent experiences our users enjoy every day. We are seeking a ML Infrastructure Engineer to design, optimize, and scale the systems that power large-scale model training across the organization. This role sits at the intersection of high-performance computing, machine learning, and infrastructure engineering, delivering the core capabilities teams rely on to iterate quickly and reliably.The ideal candidate brings strong software engineering fundamentals, deep familiarity with distributed training, and a passion for building infrastructure that is efficient, observable, and easy for ML practitioners to use. You'll work closely with model developers and platform teams to ensure training workflows are fast, reliable, and cost-effective-while also supporting users operationally to keep them unblocked and productive.

Responsibilities

  • Build and maintain distributed training infrastructure
  • Optimize training performance through profiling, parallelization strategies and hardware-aware tuning.
  • Develop reliable pipelines for data loading, checkpointing, logging, and monitoring to support high-throughput training jobs.
  • Collaborate directly with ML engineers to understand scaling bottlenecks and design solutions that improve both training speed and resource efficiency.
  • Create and maintain tooling that simplifies how users configure, launch, and debug distributed training jobs.
  • Implement strong observability across training workflows-telemetry, dashboards, alerts, and diagnostics.
  • Support training workloads, investigate failures, triage performance regressions, and gather real feedback from users.
  • Strong communication skills and the ability to collaborate with ML practitioners and infra teams.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Industry

Computer and Electronic Product Manufacturing

Education Level

No Education Listed

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service