Software Engineer - AI Inference for Science

Argonne National Laboratory•Lemont, IL

35d•$94,486 - $147,399

About The Position

The Argonne Leadership Computing Facility’s (ALCF) mission is to accelerate major scientific discoveries and engineering breakthroughs for humanity by designing and providing world-leading computing facilities in partnership with the computational science community. We help researchers solve some of the world’s largest and most complex problems with our unique combination of supercomputing resources and computational science expertise. The ALCF has an opening for a Software Engineer working in the space of enabling AI for science, specifically targeting scalable inference leveraging HPC systems and AI accelerators. The successful candidate will join the Data Services and Workflows group, which focuses on scientific workflows that combine large-scale data, simulations, analysis, and AI. In this position, the candidate can expect to explore and engineer solutions for AI inference integrated within scientific workflows, via programmatic access using standard programming interfaces (e.g. OpenAI API), and through submission of large batches of prompts for parallel processing; both scenarios require efficient execution on underlying resources, including ALCF’s HPC systems and AI testbed machines. This position demands a good understanding of currently available AI models (LLMs and otherwise), their compute and memory requirements, and how to utilize the underlying hardware for high responsiveness. As the space of AI models evolves, we will adapt and deploy new models and functionality. The Data Services and Workflows group--and this position--involves work in a highly collaborative environment involving science application teams, academia and industry, as well as other national labs and agencies, to solve some of the world’s largest and most complex problems in science and engineering. The candidate will engage with science application teams and contribute to broader scientific initiatives.

Requirements

Experience with at least one AI framework is required, such as PyTorch or TensorFlow.
Comprehensive experience programming in one or more programming languages such as Python, C/C++.
Ability to create, maintain, and support high-quality software is essential.
Work with and contribute to domain-specific software and models.
Experience with version control software such as git.
Ability to work collaboratively in a fast-paced environment.
Effective written and oral communications skills.
Ability to model Argonne’s core values of impact, safety, respect, integrity and teamwork.
RD2: Bachelor’s degree and 5+ years of experience, Master’s degree and 3+ years of experience, or PhD, or equivalent.
RD3: Bachelor’s degree and 8+vyears of experience, Master’s degree and 5+ years of experience, or PhD and 4+ years of experience, or equivalent.

Nice To Haves

Experience designing or operating distributed inference or data services, including request routing, asynchronous execution, queueing, fault tolerance, and performance monitoring.
Experience integrating services with HPC schedulers (e.g., Slurm, PBS), including resource provisioning, job lifecycle management, and balancing latency-sensitive and throughput-oriented workloads.
Experience optimizing AI inference performance (e.g., batching, memory management, model parallelism, quantization, accelerator utilization) on GPU- or accelerator-based systems.
Familiarity with secure, multi-user services, including authentication/authorization, API security, and operating within institutional or regulated environments.
Experience with running simulations or AI workflows on supercomputers.