Linux Systems Engineer

Plus3 IT Systems•Charlottesville, VA

23d•Onsite

About The Position

Join Plus3 IT Systems! We are at the forefront of cloud computing, providing comprehensive and cutting-edge solutions across a wide array of critical domains. But we don’t stop at implementing technology; we are trusted advisors, delivering expert analysis to fully understand our clients unique challenges and objectives. Our passion is all about empowering our customers to reach their strategic goals. This mission is fueled by our exceptional teams of innovative technology practitioners, who bring deep technical skills and an unwavering commitment to excellence. At Plus3 IT, we foster agile, collaborative processes, working hand-in-hand with our clients to ensure transparency, flexibility, and ultimately, their success in the cloud.

Requirements

Active TS/SCI clearance required
Active or ability to obtain DoD 8140 IAT Level II certification (e.g., Security+)
Bachelor's degree in Computer Science, Information Technology, Engineering or similar; an additional 4 years of experience will be considered in lieu of degree
Minimum 6 years of Linux systems administration experience in enterprise, research computing, or distributed compute environments
Demonstrated experience supporting HPC cluster platforms or distributed compute environments at scale
Hands-on experience with workload schedulers, queue management and job troubleshooting
Proficiency in Linux command-line administration, including server configuration and system troubleshooting in distributed environments
Ability to work onsite (hybrid and remote options not available)

Nice To Haves

Direct administration experience with multi-node HPC cluster environments, including provisioning workflows and lifecycle management
Experience with parallel or distributed file systems in a cluster context
Familiarity supporting MPI or OpenMP parallel workloads and understanding of how they interact with schedulers and underlying hardware
Experience supporting GPU-enabled compute environments and CUDA-based workloads within an HPC cluster
Proficiency with configuration management tools such as Ansible or Puppet applied to cluster-scale infrastructure
Prior experience supporting systems within DoD, IC, or research laboratory environments

Responsibilities

Deploy, configure, and sustain multi-node Linux HPC cluster environments, including node provisioning, integration, and day-to-day operational support
Administer and troubleshoot workload scheduling platforms, including queue configuration, job submission workflows, and scheduler performance optimization
Support distributed and containerized compute workloads leveraging parallel frameworks and container technologies within the cluster environment
Monitor and analyze performance across compute, storage, and network layers including high-performance networking technologies and drive resolution of cluster communication issues
Support GPU-enabled compute environments and CUDA-based workloads, ensuring proper resource allocation and integration with the scheduling platform
Develop and maintain operational scripts and automation tooling (Bash, Python) to improve cluster administration efficiency and reduce manual toil

Benefits

Employer-paid health, dental, vision, life, short/long term disability, contribution to health savings account, 401(k) matching, parental leave, flexible paid vacation, and company paid holidays.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume