HPC Technical Consultant, Onsite (LANL) Los Alamos, NM

Hewlett Packard EnterpriseClovis, NM
$81,500 - $187,500Onsite

About The Position

Join a dedicated on-site team supporting operations and hardware maintenance for HPE supercomputers in one of the nation’s premier High-Performance Computing facilities. US Citizenship required. Onsite daily work required in Los Alamos, NM. This is not a remote position.

Requirements

  • Ability to obtain a Q Clearance (required)
  • US Citizenship (required)
  • Must be able to work onsite 5 days per week in Los Alamos, NM, with additional onsite work for on-call support. This is not a remote position
  • Strong mechanical aptitude and comfort using common hand tools (screwdrivers, pliers, wrenches, cable tools, etc.) for assembling, disassembling, and maintaining server hardware and related equipment
  • Ability to lift up to 50 lbs individually and up to 75 lbs with assistance
  • Solid understanding of computer hardware components (servers, drives, memory modules, power supplies, cabling, and peripherals)
  • Proficiency with basic computer operations on Windows and macOS (MacBook), including OS navigation, file management, and standard productivity tools such as Slack, SharePoint, Microsoft Office (Word, Excel, Outlook, and Teams)

Nice To Haves

  • Associate’s degree, some college, or technical training (BS preferred)
  • 2+ years of Linux System Administration Experience, including strong command-line navigation, log analysis and monitoring (journalctl, syslog, log files), troubleshooting system and application issues, and scripting/automation using Bash or Python.
  • Experience using Redfish (along with IPMI) for out-of-band server hardware management and monitoring. This includes utilizing the Redfish RESTful API for querying system health, power/thermal monitoring, firmware inventory, component status (processors, memory, drives, NICs), event logs, and performing actions such as system resets, power control, and BIOS configuration.
  • 2+ years of hands-on experience troubleshooting and maintaining server hardware in a datacenter environment, including diagnosing hardware faults (power, thermal, storage, networking), performing component replacements (drives, memory, CPUs, PSUs, HBAs, NICs), rack mounting/decommissioning servers, and managing cable infrastructure
  • 1+ year of experience with high-speed networking concepts and troubleshooting for Ethernet, HPE Slingshot, and InfiniBand fabrics, including link diagnostics, performance tuning, cable/fiber management, switch configuration, and fault isolation in large-scale HPC environments.
  • Previous experience in a 24x7 production support environment
  • Strong troubleshooting and problem-solving skills with the ability to work independently, including systematically diagnosing complex hardware, software, and network issues through log analysis, debugging tools, and root cause analysis while minimizing downtime in high-availability environments
  • Experience reading technical diagrams, schematics, and working with ticketing systems
  • Experience with Git for version control of code, scripts, configuration files, and documentation (including cloning, branching, committing, merging, and resolving conflicts)
  • Experience with High-Performance Computing (HPC) systems, clusters, or large-scale AI infrastructure
  • Experience with large-scale storage systems, including installation, configuration, monitoring, and troubleshooting of parallel file systems, enterprise SAN/NAS solutions, object storage, and high-capacity disk arrays.
  • Industry Certifications (any of the following): CompTIA Linux+, CompTIA Security+, CompTIA Server+, CompTIA A+, CompTIA Network+, ITIL Foundation

Responsibilities

  • Monitor and maintain system health across large-scale HPC compute, network, and storage infrastructure
  • Troubleshoot and repair hardware issues on HPC servers and supporting systems
  • Perform basic Linux system administration tasks as needed
  • Create, monitor, update, and close support tickets
  • Perform hardware component replacements using spares
  • Operate hand tools and low-power tools for server maintenance
  • Track and document hardware repairs, part replacements, and returns
  • Create, update, and maintain site documentation, processes, and workflows
  • Assist with new system installation and expansion activities
  • Read system documentation and diagrams to locate components
  • Collaborate with team members using email, Teams, Slack, and in-person communication
  • Participate in on-call schedule to support 24x7 operations
  • Maintain tools and workspace in an organized manner

Benefits

  • Health & Wellbeing comprehensive suite of benefits that supports their physical, financial and emotional wellbeing.
  • Personal & Professional Development programs catered to helping you reach any career goals you have
  • Unconditional Inclusion
  • Flexibility to manage work and personal needs
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service