Modern supercomputers such as Argonne's Aurora Exascale machine are large and complex, and comprise so many components that a small failure rate per component translates into an appreciable failure rate for the entire machine. In order for applications to be made resilient in the face of such unavoidable failures, it is necessary to characterize the application failure rate with good precision, and if possible with some specificity regarding the type and hardware usage pattern of applications. This requires combining of various streams of logging information, and nontrivial statistical analysis, possibly supplemented by machine learning techniques. This project targets the study of application failure rates on Aurora, based on such analyses. The student will explore the log data streams to discover effective ways of tracking application failure rates, write code to analyze, categorize, and visualize application interrupts, and help develop models of machine computational efficiency.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Intern
Education Level
No Education Listed