HPC System Software Engineer

Lawrence Berkeley National Laboratory•Berkeley, CA

59d•Hybrid

About The Position

Lawrence Berkeley National Laboratory is hiring an HPC System Software Engineer within the National Energy Research Scientific Computing Center (NERSC) division. In this exciting role, you will be pivotal in architecting, developing, deploying, and supporting the software that forms the backbone of NERSC's world-class supercomputing infrastructure. Your primary role will be to engineer robust, scalable, dynamic, and automated solutions for high-performance computing (HPC) system management and large-scale monitoring, directly enabling the operation of NERSC's flagship systems, including the current Perlmutter supercomputer and the upcoming Doudna system. You will join a collaborative environment, working with engineers at NERSC, other national laboratories, leading HPC vendors, and vibrant open-source communities. This is a unique opportunity to build the foundational software that powers world-class scientific research and to define the future of programmable, data-driven HPC data centers, as well as the American Science Cloud. The selected candidate(s) will be hired at the Computer Systems Engineer 3 or 4 (CSE3 or CSE4) depending on their level skills and experience. You Will (Level 3): Develop and maintain software for automated provisioning, configuration management, and orchestration across thousands of servers, with a focus on the OpenCHAMI system management software stack. Contribute to the development and operation of NERSC's large-scale data center monitoring framework. Analyze system telemetry and logs to debug complex, system-wide issues, identify performance bottlenecks. Develop and maintain plugins for the Slurm workload manager. Identify and automate operational tasks and system management processes to improve the efficiency, reliability, and scalability of HPC systems. Participate in the full lifecycle of HPC systems, including installation, configuration, testing, operation, and maintenance. Contribute to a shared on-call rotation to provide 24x7 support for critical HPC systems and infrastructure. Take ownership of new technical assignments, determine appropriate methods and procedures, and coordinate the activities of other personnel on smaller projects or focused technical efforts. Collaborate with vendors to troubleshoot bugs, provide feedback on technical requirements, and track the resolution of issues affecting NERSC HPC and monitoring systems. Evaluate and test new technologies, software, and system architectures to inform future designs. Contribute code and engage with open-source communities that are critical to the HPC ecosystem, representing NERSC's technical interests. Work on and resolve complex issues where analysis of situations or data requires an in-depth evaluation of variable factors. In Addition to Above, You Will (Level 4): Design major software components for system management and monitoring, creating long-term roadmaps to ensure scalability, reliability, and future-readiness. Lead the technical vision for key areas of the system software stack, making critical design decisions that impact the entire HPC ecosystem. Proactively identify, evaluate, and champion emerging technologies and architectural patterns that can significantly enhance NERSC’s capabilities, performance, and operational efficiency. Solve the most significant and ambiguous technical issues, often requiring cross-functional team collaboration and an in-depth, multi-faceted analysis of complex systems. Lead the implementation and deployment of critical system improvements, taking full ownership of projects from conception and requirements gathering through to production operation and support. Provide technical leadership and mentorship to team members and colleagues across NERSC, guiding best practices in software design, development, security, and operations. Act as a primary technical liaison with HPC vendors and partner institutions, driving the co-development of features and solutions that meet NERSC's strategic needs. Represent NERSC in national and international forums, technical working groups, and open-source communities, influencing the direction of future HPC technologies to benefit the scientific community. Determine methods and procedures on new or complex assignments, and formally coordinate the activities of other engineers to achieve project goals. Work on and resolve significant and unique issues where analysis of situations or data requires an evaluation of intangibles.

Requirements

Typically requires a minimum of 8 years of related experience with a Bachelor’s degree; or 6 years and a Master’s degree; or equivalent experience.
Minimum of 4 years of experience with systems programming in Linux environments or management of large-scale Linux-based systems in a high-performance computing, cloud computing, or hyper-scale environment.
Experience with some or all of our key technologies: containers (such as Docker or Kubernetes) configuration management (such as Ansible or Puppet) monitoring and observability (such as VictoriaMetrics, Prometheus, or Nagios) virtualization (such as Proxmox or Harvester) git-based CI/CD pipelines (such as GitLab runners or GitHub Actions) continuous delivery tools (such as Argo CD or Flux) modern programming languages (such as Go or Rust) complex scripting with tools such as Python 3 or bash
Familiarity with provisioning tools (such as Chef, Foreman, or Terraform)
Working knowledge of software engineering best practices for performance and security.
Demonstrated experience in to resolving complex issues in creative and effective ways.
Excellent oral and written communication skills.
Demonstrated ability to work effectively as part of a cross-disciplinary team.
Typically requires a minimum of 12 years of related experience with a Bachelor’s degree; or 8 years and a Master’s degree; or equivalent experience.
Experience leading and coordinating complex software projects.
Experience with software lifecycle management, from planning through retirement
Strong Linux systems programming skills and knowledge of Linux system internals.
Demonstrated experience in working on and resolving significant and unique issues where analysis of situations or data requires an evaluation of intangibles.
Ability to exercise independent judgment in methods, techniques and evaluation criteria for obtaining results.

Responsibilities

Develop and maintain software for automated provisioning, configuration management, and orchestration across thousands of servers, with a focus on the OpenCHAMI system management software stack.
Contribute to the development and operation of NERSC's large-scale data center monitoring framework.
Analyze system telemetry and logs to debug complex, system-wide issues, identify performance bottlenecks.
Develop and maintain plugins for the Slurm workload manager.
Identify and automate operational tasks and system management processes to improve the efficiency, reliability, and scalability of HPC systems.
Participate in the full lifecycle of HPC systems, including installation, configuration, testing, operation, and maintenance.
Contribute to a shared on-call rotation to provide 24x7 support for critical HPC systems and infrastructure.
Take ownership of new technical assignments, determine appropriate methods and procedures, and coordinate the activities of other personnel on smaller projects or focused technical efforts.
Collaborate with vendors to troubleshoot bugs, provide feedback on technical requirements, and track the resolution of issues affecting NERSC HPC and monitoring systems.
Evaluate and test new technologies, software, and system architectures to inform future designs.
Contribute code and engage with open-source communities that are critical to the HPC ecosystem, representing NERSC's technical interests.
Work on and resolve complex issues where analysis of situations or data requires an in-depth evaluation of variable factors.
Design major software components for system management and monitoring, creating long-term roadmaps to ensure scalability, reliability, and future-readiness.
Lead the technical vision for key areas of the system software stack, making critical design decisions that impact the entire HPC ecosystem.
Proactively identify, evaluate, and champion emerging technologies and architectural patterns that can significantly enhance NERSC’s capabilities, performance, and operational efficiency.
Solve the most significant and ambiguous technical issues, often requiring cross-functional team collaboration and an in-depth, multi-faceted analysis of complex systems.
Lead the implementation and deployment of critical system improvements, taking full ownership of projects from conception and requirements gathering through to production operation and support.
Provide technical leadership and mentorship to team members and colleagues across NERSC, guiding best practices in software design, development, security, and operations.
Act as a primary technical liaison with HPC vendors and partner institutions, driving the co-development of features and solutions that meet NERSC's strategic needs.
Represent NERSC in national and international forums, technical working groups, and open-source communities, influencing the direction of future HPC technologies to benefit the scientific community.
Determine methods and procedures on new or complex assignments, and formally coordinate the activities of other engineers to achieve project goals.
Work on and resolve significant and unique issues where analysis of situations or data requires an evaluation of intangibles.

Benefits

Exceptional health and retirement benefits, including pension or 401K-style plans
Opportunities to grow in your career - check out our Tuition Assistance Program
A culture where you’ll belong - we are invested in our teams!
In addition to accruing vacation and sick time, we also have a Winter Holiday Shutdown every year.
Parental bonding leave (for both mothers and fathers)
Pet insurance

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume