Systems Analyst (/Site Reliability Engineer)

Hewlett Packard Enterprise
79d$115,500 - $266,000

About The Position

We are seeking a skilled Systems Analyst (/Site Reliability Engineer) at HPE to support Oak Ridge National Laboratory (ORNL). This is a unique, on site, customer facing opportunity to work with some of the world's most advanced high-performance computing (HPC) systems, including Frontier, the world’s first exascale supercomputer. As part of our team, you will play a critical role in the deployment, maintenance, and optimization of large-scale computing software infrastructure and hardware, ensuring system reliability for cutting-edge scientific research.

Requirements

  • Due to the nature of the work, this position requires either U.S. Citizenship or U.S. Lawful Permanent Resident (LPR) status.
  • Bachelor’s in Computer Science, Computer Engineering, or a related field, with at least 2 years of experience, OR a Master’s in Computer Science or Computer Engineering of a related field.
  • HPC System Experience: Experience using SLURM-based HPC systems, both as a user and preferably as a system administrator.
  • Technical Skills: Proficient in Linux, Python, and Bash scripting. Familiarity with C++/Fortran-based HPC application development, GPUs, MPI, and high-performance computing tools.
  • Application Build and Configuration Knowledge: Strong understanding of application build processes, including compiler configurations, library integration, and dependency management, to effectively set up environments, perform upgrades, and troubleshoot build and runtime issues.
  • Log analysis: Experience in large-scale log analysis and troubleshooting performance, bugs or system failures.
  • Communication Skills: Strong written and verbal communication skills, with the ability to document and share knowledge effectively with internal teams and end-users.
  • Industry Knowledge: Familiarity with emerging HPC trends, system architectures, and optimization strategies.

Nice To Haves

  • Accountability
  • Active Learning
  • Active Listening
  • Bias
  • Business Growth
  • Client Expectations Management
  • Coaching
  • Creativity
  • Critical Thinking
  • Cross-Functional Teamwork
  • Customer Centric Solutions
  • Customer Relationship Management (CRM)
  • Design Thinking
  • Empathy
  • Follow-Through
  • Growth Mindset
  • Information Technology (IT) Infrastructure
  • Infrastructure as a Service (IaaS)
  • Intellectual Curiosity
  • Long Term Planning
  • Managing Ambiguity
  • Process Improvements
  • Product Services
  • Relationship Building

Responsibilities

  • Maintain and optimize compute infrastructure across multiple large-scale HPC systems.
  • Participate in the deployment, testing, and validation of live high-performance computing clusters.
  • Troubleshoot node failures by analyzing OS internals, compiler behavior, and system logs, coordinating with internal subject-matter experts as needed.
  • Conduct routine and on-demand maintenance, troubleshooting, and performance tuning for large-scale HPC environments.
  • Collaborate with researchers, engineers, and technical staff to open, maintain and close JIRA tickets to ensure system reliability and efficiency for high-stakes, high-performance scientific research.
  • Investigate and document complex software and system-level issues, acting as a bridge between users and HPE internal teams.
  • Develop and implement automation tools, scripts, and monitoring solutions to streamline system management.
  • Stay up-to-date with advancements in HPC technologies, including GPU acceleration (e.g., ROCm), parallel computation (Cray PE, MPI/OpenMP), and performance tuning.

Benefits

  • Health & Wellbeing: Comprehensive suite of benefits that supports physical, financial and emotional wellbeing.
  • Personal & Professional Development: Programs catered to helping you reach career goals.
  • Unconditional Inclusion: A culture that celebrates individual uniqueness and values varied backgrounds.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service