The Argonne Leadership Computing Facility’s (ALCF) mission is to accelerate major scientific discoveries and engineering breakthroughs for humanity by designing and providing world-leading computing facilities in partnership with the computational science community. We help researchers solve some of the world’s largest and most complex problems with our unique combination of supercomputing resources and computational science expertise. The ALCF Performance Engineering Group invites applications for a postdoctoral appointee to develop and scale profiling capabilities for large, heterogeneous HPC workflows that combine AI and traditional modeling and simulation (ModSim). You will work with cutting-edge exascale systems and novel AI hardware, collaborating closely with science application teams, academia, industry partners, and other national laboratories. Objective: Enhance THAPI: Extend and optimize the THAPI profiler (https://github.com/argonne-lcf/THAPI) to concurrently profile AI/ML and ModSim components at scale. API & Tracing Integration: Design and implement new tracing API layers to capture fine-grained performance data across diverse runtime environments. The main targets will be tracing additional communication layers (NCCL, libfabric) and Python-based applications (either via Python internals or via native Python libraries supports such as PyTorch).
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Education Level
Ph.D. or professional degree
Number of Employees
1,001-5,000 employees