About The Position

The Argonne Leadership Computing Facility’s (ALCF) mission is to accelerate major scientific discoveries and engineering breakthroughs for humanity by designing and providing world-leading computing facilities in partnership with the computational science community. We help researchers solve some of the world’s largest and most complex problems with our unique combination of supercomputing resources and computational science expertise. The ALCF Performance Engineering Group invites applications for a postdoctoral appointee to develop and scale profiling capabilities for large, heterogeneous HPC workflows that combine AI and traditional modeling and simulation (ModSim). You will work with cutting-edge exascale systems and novel AI hardware, collaborating closely with science application teams, academia, industry partners, and other national laboratories. Objective: Enhance THAPI: Extend and optimize the THAPI profiler ( https://github.com/argonne-lcf/THAPI ) to concurrently profile AI/ML and ModSim components at scale. API & Tracing Integration: Design and implement new tracing API layers to capture fine-grained performance data across diverse runtime environments. The main targets will be tracing additional communication layers (NCCL, libfabric) and Python-based applications (either via Python internals or via native Python libraries supports such as PyTorch).

Requirements

  • Ph.D. (completed within the last 0-5 years) or equivalent experience in a computational science discipline, computer science, or in a related field.
  • Hands-on experience with performance profiling and tracing tools (LTTng, Babeltrace, perf, ftrace, etc.).
  • Strong C (and/or C++) system-programming skills and familiarity with dynamic linking (e.g., ldd).
  • Experience developing and optimizing scientific workflows, ideally combining AI and traditional simulations.
  • Experience with scientific computing and software development on HPC systems.
  • Ability to conduct independent research and demonstrated publication record in peer-reviewed journals and conferences.
  • The successful candidate will be expected to work with and contribute to open-source projects and community-driven initiatives within computational science.
  • Effective communication skills, both verbal and written, for effective collaboration with interdisciplinary teams and clear presentation of complex technical information.
  • Ability to model Argonne’s core values of impact, safety, respect, integrity and teamwork.

Nice To Haves

  • Proficiency in additional programming languages (e.g., C++, Ruby) and metaprogramming technique.
  • Experience with HPC programming models (MPI, OpenMP, SYCL, Cuda).
  • Experience in writing technical papers and presentations.

Responsibilities

  • Enhance THAPI: Extend and optimize the THAPI profiler ( https://github.com/argonne-lcf/THAPI ) to concurrently profile AI/ML and ModSim components at scale.
  • API & Tracing Integration: Design and implement new tracing API layers to capture fine-grained performance data across diverse runtime environments.
  • The main targets will be tracing additional communication layers (NCCL, libfabric) and Python-based applications (either via Python internals or via native Python libraries supports such as PyTorch).

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Education Level

Ph.D. or professional degree

Number of Employees

1,001-5,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service