HPC Linux Storage Engineer

Oak Ridge National Laboratory•Oak Ridge, TN

29d•Hybrid

About The Position

Oak Ridge National Laboratory (ORNL), home to some of the world’s most powerful supercomputers, is seeking highly skilled professionals to support large-scale storage systems, high-speed parallel file systems, and archival solutions critical to advancing scientific discovery and innovation. As part of ORNL’s leadership-class computing ecosystem, you will play a vital role in designing, deploying, optimizing, and maintaining infrastructure that powers cutting-edge research across diverse scientific domains. This evergreen posting represents multiple opportunities across ORNL’s high-performance computing (HPC) environment, supporting scalable, reliable, and secure computing and storage capabilities. Applications are reviewed on an ongoing basis as new positions become available to meet the dynamic needs of our world-class computing facility.

Requirements

Bachelor’s degree in computer science, engineering, information technology, or a related field; and at least 5 years of professional experience managing Linux/UNIX systems in heterogeneous environments. An equivalent combination of education and experience will be considered.
Demonstrated experience with high-performance computing (HPC) storage systems and enterprise storage platforms (e.g., Lustre, GPFS, BeeGFS, or WEKA).
Proficiency in scripting languages (e.g., Python, Bash, Perl) and configuration management/automation tools (e.g., Ansible, Puppet, Git).
Strong communication, collaboration, and problem-solving skills with the ability to design and implement solutions independently.

Nice To Haves

Active DOE Q, DoD Top Secret, or TS/SCI clearance.
Hands-on experience with HPC cluster technologies, including job schedulers (e.g., SLURM) and system deployment tools (e.g., Warewulf, PXEboot, Bright Cluster Manager).
Expertise in high-performance parallel file systems, tape library systems, and storage networking technologies (e.g., RAID, ZFS, NVMe-oF, Infiniband).
Familiarity with performance monitoring tools (e.g., Grafana, Nagios), benchmarking systems, and I/O optimization techniques.
Experience with virtualization and containerization platforms (e.g., VMware, KVM, Podman, Apptainer).
Background in open source development, including submitting patches upstream, and building custom Linux packages (e.g., RPM for RHEL).
Demonstrated ability to troubleshoot and optimize high-performance storage, compute, and networking systems in HPC environments.
Experience documenting technical processes and contributing to complex technical projects in government, scientific, or highly technical settings.

Responsibilities

Architect, deploy, and manage large-scale storage systems and HPC platforms to support research, scientific, and enterprise workloads.
Develop and implement solutions for structured, unstructured, and archival data storage, focusing on scalability, reliability, and performance.
Apply systems analysis techniques to consult with users/customers, determine functional requirements, and design, test, or optimize storage and computational solutions tailored to their needs.
Develop, document, and modify solutions, including system prototypes and automated workflows, to enhance operational efficiency.
Ensure the performance, availability, scalability, and security of diverse infrastructure environments.
Diagnose and resolve complex operational challenges quickly and effectively, applying advanced performance optimization techniques for a wide range of workloads.
Work closely with stakeholders from research, technical, and operational teams to understand workflows, identify opportunities for improvement, and deliver effective solutions.
Define, implement, and enforce best practices, standards, and procedures across projects and teams.
Automate system configuration, provisioning, monitoring, and maintenance to reduce manual efforts and downtime.
Evaluate emerging technologies and tools to continuously improve system capabilities, adapt to changing needs, and plan for future advancements.
Support critical infrastructure through participation in a 24/7 on-call rotation and off-hours maintenance windows.
Resolve hardware and software issues in coordination with vendors, ensuring minimal impact on operations.