Senior HPC Linux Systems Engineer

Xcel EngineeringOak Ridge, TN
2d

About The Position

XCEL Engineering is seeking a qualified applicant for a Senior HPC Linux Systems Engineer to work for the National Center for Computational Sciences (NCCS) at Oak Ridge National Lab (ORNL), which hosts several of the world's most powerful computer systems, is seeking a highly qualified individual to play a key role in improving the security, performance, and reliability of the NCCS computing environments. This includes supporting one of the fastest supercomputers in the world, Frontier, along with numerous commodity clusters and specialized programs and partnerships. Frontier is one of the scientific research community's most powerful computational instruments for exploring solutions to some of today's most challenging problems.

Requirements

  • Bachelor's Degree in a scientific or technical field
  • 8+ years of Linux systems experience is
  • An equivalent combination of education and experience will be considered

Nice To Haves

  • Experience managing Linux operating systems in a large-scale system environment
  • Solid understanding of networked computing environment concepts
  • Experience with Linux Cluster Administration
  • Ability to develop and maintain programs and scripts that aid in the operation and automation of administrative tasks using various shell and scripting languages (bash, Python, Go)
  • Experience with Lustre and GPFS file systems
  • Experience with batch schedulers (particularly SLURM)
  • Experience deploying and maintaining automated configuration management software such as Puppet
  • Strong interpersonal and communication skills
  • Ability to work as a team player
  • Proactive and solution-oriented problem solver
  • Prior project and/or team leadership experience

Responsibilities

  • Install, integrate, and administer HPC Linux clusters and high-speed networks
  • Diagnosing system operational problems quickly and effectively
  • Coordinating with vendors to resolve hardware and software problems
  • Recommending, planning, and coordinating hardware and software changes with customer participation using change management processes
  • Porting and writing system management tools
  • Documenting system administration procedures for routine and complex tasks
  • Participating in a 24-hour, 7-day on-call support rotation and off-hours maintenance windows
  • System implementation/integration into the NCCS environment and systems performance
  • Lead system deployment, integration and troubleshooting of a large-scale computer
  • Participate in relevant systems topics with the internal and external community of peers contributing experiences and solutions.
  • Mentor junior-level staff as they join the
  • Deliver ORNL's mission by aligning behaviors, priorities, and interactions with our core values of Impact, Integrity, Teamwork, Safety, and Service.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service