Senior HPC Engineer

Texas A&M University System•College Station, TX

8d•$130,000 - $140,000•Onsite

About The Position

We are making a bold leap into the future of artificial intelligence with a $45 million investment in an NVIDIA DGX SuperPOD. This investment underscores our commitment to all Texas A&M System members’ faculty and staff providing cutting-edge research and super computing needs. As a Senior High Performance Computing Engineer (HPC), you will provide technical expertise and consultation for the design and deployment of HPC systems. Get in on the ground floor with a team that is shaping the next generation of innovation. This position is security sensitive requiring U.S. Citizenship.

Requirements

Bachelor’s degree in applicable field or equivalent combination of education and experience
12 years of related experience
Experience with High Performance Computing (HPC) environments
Advanced Linux system administration skills
Familiarity with computer networking concepts and protocols
Experience with container orchestration tools such as Kubernetes
Knowledge of Run:ai for AI workload management
Proficiency with Slurm workload manager
Experience working with NVIDIA DGX systems
Understanding of virtualization technologies
Familiarity with Infrastructure as a Service (IaaS) platforms
Experience with DDN storage solutions
Knowledge of network-attached storage systems
Expertise in scalable supercomputing architectures, interconnects, and storage systems.
Proficiency in scripting (Python, Bash, Perl) and scientific computing (MPI, OpenMP, CUDA).
Experience with configuration management tools (Ansible, Puppet).
Familiarity with container technologies (Docker, Singularity, Kubernetes).
Strong troubleshooting, communication, and strategic planning skills.
Must be a United States citizen, permanent resident, or a person granted asylum or refugee status in accordance with 15 CFR, Part 762; 22 CFR §§122.5, 123.22 and 123.26; and 31 CFR § 501.601
All positions are security-sensitive.
Applicants are subject to a criminal history investigation, and employment is contingent upon the institution’s verification of credentials and/or other information required by the institution’s procedures, including the completion of the criminal history check.

Responsibilities

Manage large-scale HPC cluster operations, including OS upgrades, firmware patching, and performance tuning.
Oversee networking, security, and infrastructure for HPC systems.
Lead the development of specialized HPC computing clouds and scalable storage systems.
Collaborate with stakeholders to develop service-based solutions.
Serve as a strategic technical resource across departments.
Lead enterprise-wide HPC projects using established project management protocols.
Mentor junior system administrators and enforce performance standards.