Oak Ridge National Laboratory-posted about 2 months ago
Full-time • Mid Level
Oak Ridge, TN
5,001-10,000 employees
Professional, Scientific, and Technical Services

Oak Ridge National Laboratory (ORNL) is seeking highly motivated HPC Linux Systems Engineers to join teams operating some of the most advanced computing environments in the world. This evergreen posting represents multiple potential openings across ORNL's high-performance computing ecosystem. Successful candidates will help architect, deploy, and maintain HPC systems that accelerate discovery across open science, laboratory research, and secure computing missions. Applications are reviewed on a continual basis as new opportunities arise.

  • Install, integrate, and administer Linux-based HPC clusters, storage systems, and high-speed networks.
  • Monitor and optimize system performance, reliability, and scalability for large-scale computational workloads.
  • Diagnose complex hardware and software issues, coordinating with vendors and internal engineering teams to implement solutions.
  • Participate in system design, deployment, acceptance testing, and upgrades for leadership-class and research computing systems.
  • Develop and maintain automation, configuration management, and monitoring solutions using tools such as Ansible, Puppet, Bash, or Python.
  • Collaborate with scientists, researchers, and technical staff to ensure HPC resources effectively support scientific and mission objectives.
  • Support identity management, authentication, and access control frameworks to maintain secure and compliant environments.
  • Document system architectures, processes, and best practices, and contribute to internal knowledge sharing.
  • Participate in on-call rotations and off-hours maintenance windows as required to support 24x7 operations.
  • Bachelor's degree in computer science, engineering, or a related field.
  • A minimum of 5 years of experience in Linux systems administration, or an equivalent combination of education and experience
  • Experience administering HPC clusters or large-scale Linux computing environments.
  • Familiarity with batch schedulers (e.g., SLURM, PBS, LSF) and parallel file systems (Lustre, GPFS/Spectrum Scale).
  • Experience implementing and managing automation and configuration management frameworks (Ansible, Puppet, Salt).
  • Proficiency in scripting or programming (Python, Bash, Go).
  • Understanding of networking fundamentals and high-speed interconnects (InfiniBand, Ethernet).
  • Experience deploying or supporting identity management and multi-factor authentication systems (PingFederate, RSA SecureID, Entra ID).
  • Familiarity with virtualization or containerization technologies (VMware, KVM, Podman, Apptainer).
  • Experience troubleshooting and tuning high-performance storage, networking, and compute systems.
  • Excellent communication, collaboration, and problem-solving skills.
  • Demonstrated ability to lead or contribute to complex technical projects with minimal supervision.
  • Work on the world's most powerful supercomputers, including Frontier, the first system to achieve exascale performance.
  • Enable breakthrough science in fields like fusion energy, climate modeling, AI, and national security.
  • Collaborate with diverse teams of scientists, engineers, and technologists from across the DOE complex and academia.
  • Grow your career in a mission-driven, innovation-focused environment with access to professional development and leadership opportunities.
  • Enjoy life in East Tennessee, with a thriving research community, scenic outdoor recreation, and a high quality of life.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service