Linux Systems Engineer

Plus3 IT SystemsCharlottesville, VA
Onsite

About The Position

Join Plus3 IT Systems! We are at the forefront of cloud computing, providing comprehensive and cutting-edge solutions across a wide array of critical domains. But we don’t stop at implementing technology; we are trusted advisors, delivering expert analysis to fully understand our clients unique challenges and objectives. Our passion is all about empowering our customers to reach their strategic goals. This mission is fueled by our exceptional teams of innovative technology practitioners, who bring deep technical skills and an unwavering commitment to excellence. At Plus3 IT, we foster agile, collaborative processes, working hand-in-hand with our clients to ensure transparency, flexibility, and ultimately, their success in the cloud.

Requirements

  • Active TS/SCI clearance required
  • Active or ability to obtain DoD 8140 IAT Level II certification (e.g., Security+)
  • Bachelor's degree in Computer Science, Information Technology, Engineering or similar; an additional 4 years of experience will be considered in lieu of degree
  • Minimum 6 years of Linux systems administration experience in enterprise, research computing, or distributed compute environments
  • Demonstrated experience supporting HPC cluster platforms or distributed compute environments at scale
  • Hands-on experience with workload schedulers, queue management and job troubleshooting
  • Proficiency in Linux command-line administration, including server configuration and system troubleshooting in distributed environments
  • Ability to work onsite (hybrid and remote options not available)

Nice To Haves

  • Direct administration experience with multi-node HPC cluster environments, including provisioning workflows and lifecycle management
  • Experience with parallel or distributed file systems in a cluster context
  • Familiarity supporting MPI or OpenMP parallel workloads and understanding of how they interact with schedulers and underlying hardware
  • Experience supporting GPU-enabled compute environments and CUDA-based workloads within an HPC cluster
  • Proficiency with configuration management tools such as Ansible or Puppet applied to cluster-scale infrastructure
  • Prior experience supporting systems within DoD, IC, or research laboratory environments

Responsibilities

  • Deploy, configure, and sustain multi-node Linux HPC cluster environments, including node provisioning, integration, and day-to-day operational support
  • Administer and troubleshoot workload scheduling platforms, including queue configuration, job submission workflows, and scheduler performance optimization
  • Support distributed and containerized compute workloads leveraging parallel frameworks and container technologies within the cluster environment
  • Monitor and analyze performance across compute, storage, and network layers including high-performance networking technologies and drive resolution of cluster communication issues
  • Support GPU-enabled compute environments and CUDA-based workloads, ensuring proper resource allocation and integration with the scheduling platform
  • Develop and maintain operational scripts and automation tooling (Bash, Python) to improve cluster administration efficiency and reduce manual toil

Benefits

  • Employer-paid health, dental, vision, life, short/long term disability, contribution to health savings account, 401(k) matching, parental leave, flexible paid vacation, and company paid holidays.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service