Advisor - GPU Platform Engineering

Eli Lilly and CompanyIndianapolis, IN
127d$135,000 - $213,400

About The Position

At Lilly, we unite caring with discovery to make life better for people around the world. We are a global healthcare leader headquartered in Indianapolis, Indiana. Our employees around the world work to discover and bring life-changing medicines to those who need them, improve the understanding and management of disease, and give back to our communities through philanthropy and volunteerism. We give our best effort to our work, and we put people first. We’re looking for people who are determined to make life better for people around the world. Come help us unlock the power of AI and HPC based POGPU and Accelerated Compute infrastructure! The Cloud and Connectivity organization is seeking experts and leaders in AI and High-Performance Computing (HPC), and Nvidia DGX server management. This role will also focus on DGX Server mgmt., Spectrum X networking technologies, and Weka Storage integration to support cutting-edge AI/ML workloads.

Requirements

  • Expertise in Linux system administration, HPC environments, and Nvidia DGX server management.
  • Experience with Spectrum X networking and parallel file systems.
  • Strong scripting skills and familiarity with containerization and automation tools.
  • 6+ years of demonstrated experience in AI/ML and HPC workloads and infrastructure.
  • Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure.
  • In-depth knowledge of accelerated computing (e.g., GPU), storage (e.g., Weka), scheduling & orchestration (e.g., Slurm, Kubernetes, LSF), high-speed networking (e.g., Ultra-Ethernet, RoCE), and containers technologies (Docker).
  • Passion for continual learning and keeping abreast of new technologies and effective approaches in the AI/ML infrastructure field.
  • Expertise in running and optimizing large-scale distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX.
  • Deep understanding of AI/ML workflows, encompassing data processing, model training, and inference pipelines.
  • Proficiency in at least one scripting language such as Bash, Python, or equivalent.

Nice To Haves

  • Bachelor’s degree in computer science, Information Technology, or related technical field.
  • 10+ years’ experience as a Linux OS/ Platform Engineer.
  • Demonstrated experience leading a global large-scale Infrastructure project.

Responsibilities

  • Driving the engineering and operations of advanced Linux platforms supporting AI and HPC workloads.
  • Managing Nvidia DGX systems using Mission Control, Base Command and Run:AI.
  • Optimizing Spectrum X networking and WEKA storage for AI/ML applications.
  • Boosting productivity for Advanced Intelligence and Data science teams through implementing advancements across AI/HPC infrastructure tooling and operational excellence.
  • Leading the strategy, engineering and development of Advanced Linux computing capabilities for AI/ML.
  • Advising with senior Linux platform engineer directing the global Linux strategy for on-premises private cloud and public IaaS Linux services.

Benefits

  • Eligibility to participate in a company-sponsored 401(k); pension.
  • Vacation benefits.
  • Eligibility for medical, dental, vision and prescription drug benefits.
  • Flexible benefits (e.g., healthcare and/or dependent day care flexible spending accounts).
  • Life insurance and death benefits.
  • Certain time off and leave of absence benefits.
  • Well-being benefits (e.g., employee assistance program, fitness benefits, and employee clubs and activities).

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

Bachelor's degree

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service