Linux/HPC Systems Engineer

COLSA CorporationHuntsville, AL
1dOnsite

About The Position

In this role, your daily impact spans the entire spectrum of systems engineering. One hour, you might be performing routine lifecycle maintenance—patching a fleet of RHEL workstations or managing user identities across a heterogeneous domain—to ensure the baseline stability of our enterprise. The next, you are diving into the high-performance fabric, debugging a latency spike on an InfiniBand card or fine-tuning a Slurm scheduler to prioritize a mission-critical simulation. You aren't just managing boxes; you are the bridge between raw silicon and national security breakthroughs. Whether it's the methodical "hardening" of a standard server build to meet SAP requirements or the high-adrenaline optimization of a multi-petabyte Lustre filesystem, your work ensures that our researchers never have to wait on the infrastructure to catch up with their imagination. This position is 100% on-site.

Responsibilities

  • Architect & Deploy: Lead the design and lifecycle management of mission-critical Linux workstations, enterprise-grade servers, and high-performance computing (HPC) clusters.
  • Engineer Filesystems: Master the art of data movement. Administer complex local and distributed filesystems (Lustre, GPFS/Spectrum Scale) to ensure extreme-speed access across the fabric.
  • Infrastructure as Code (IaC): Treat the data center as a codebase. Develop sophisticated automation workflows using Python, Bash, and Ansible to eliminate manual toil and ensure drift-free configurations.
  • Defensive Engineering: Implement "Hardened by Design" security. Fine-tune SELinux policies and advanced firewall configurations to protect sensitive data without sacrificing computational performance.
  • Container Orchestration: Modernize scientific workflows by deploying and managing isolated environments using Podman while working to establish a Kubernetes environment.
  • HPC Performance Tuning: Push the limits of the silicon. Optimize cluster scheduling and management utilizing industry-leading tools like Bright Cluster Manager and Slurm.
  • Low-Latency Networking: Configure and optimize high-bandwidth networking, including InfiniBand fabrics, for seamless inter-node communication.
  • Technical Documentation: Author high-fidelity playbooks and strategic architectural diagrams that serve as the blueprint for our evolving infrastructure.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service