AT&T Global Public Sector is a trusted provider of secure, IP enabled, cloud-based, network solutions and professional services to the Federal Government. We are dedicated to recruiting, developing and empowering a diverse, high-performing workforce that is passionate about what they do, committed to our shared values and dedicated to our customers’ mission. The scope of this Contract requires specialized expertise in areas such as high-performance computing (HPC), automated processing systems, distributed software design, and secure hosting and networking solutions. The IT infrastructure consists of primarily Linux, with some Windows, and UNIX. The environment includes a variety of network devices, server interconnections, mass storage solutions, and essential supporting infrastructure services. The services provided under this Contract support areas including HPC, infrastructure maintenance for HPC systems, networking, office automation, and the development of specialized software. AT&T has an opening for a High-Performance Computing (HPC) Systems Administrator to support a large client-based IT enterprise installation, configuration, and networking of Linux-based platforms. This position requires office presence a minimum of 5 days per week and is only located in the location(s) posted. No relocation is offered. Work to be performed at government customer location. Description of Job Duties/Responsibilities: The System Administrator provides HPC sustainment support across two geographically dispersed sites, including: Linux-based HPC clusters (e.g., Red Hat/CentOS/Rocky/Ubuntu) with parallel file systems (e.g., Lustre/GPFS) and high-speed interconnects (InfiniBand/Slingshot). Transition of new systems/capabilities into operations (clusters, SMP/MPP, parallel file systems). Support to HPC and ABS (ABUNDANTSHIELD) SRE teams in accordance with Government policies and procedures. Proficient with the following (as specific position requires): Operate and maintain systems/services: monitoring, incident response, troubleshooting, and routine maintenance. Install/configure Linux OS, file systems, and TCP/IP networking; troubleshoot OS and application issues. Automate/administer via BASH scripting; compile/install software as required. Use common operations and observability tooling: Jira, Confluence, Grafana, Prometheus, Nagios. Support HPC workload and configuration management tooling: Slurm, git, Salt, Ansible. Provide user support and escalation/status communication to agency management and internal customers. Optimize operations through resource utilization and capacity analysis/planning. Apply strong troubleshooting skills across heterogeneous systems (no single fixed solution).
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
5,001-10,000 employees