AT&T Global Public Sector is a trusted provider of secure, IP enabled, cloud-based, network solutions and professional services to the Federal Government. We are dedicated to recruiting, developing and empowering a diverse, high-performing workforce that is passionate about what they do, committed to our shared values and dedicated to our customers’ mission. The scope of this Contract requires specialized expertise in areas such as high-performance computing (HPC), automated processing systems, distributed software design, and secure hosting and networking solutions. The IT infrastructure consists primarily of Linux, with some Windows, and UNIX. The environment includes a variety of network devices, server interconnections, mass storage solutions, and essential supporting infrastructure services. The services provided under this Contract support areas including HPC, infrastructure maintenance for HPC systems, networking, office automation, and the development of specialized software. AT&T has an opening for a High-Performance Computing (HPC) Systems Administrator to support a large client based IT enterprise installation, configuration and networking of Linux and Windows based platforms. This position requires office presence a minimum of 5 days per week and is only located in the location(s) posted. No relocation is offered. Work to be performed at government customer site. Description of Job Duties/Responsibilities: The System Administrator provides HPC sustainment support across two geographically dispersed sites, including: Linux-based HPC clusters (e.g., Red Hat/CentOS/Rocky/Ubuntu) with parallel file systems (e.g., Lustre/GPFS) and high-speed interconnects (InfiniBand/Slingshot). Transition of new systems/capabilities into operations (clusters, SMP/MPP, parallel file systems). Support to HPC and ABS (ABUNDANTSHIELD) SRE teams in accordance with Government policies and procedures. Proficient with the following (as specific position requires): Operate and maintain systems/services: monitoring, incident response, troubleshooting, and routine maintenance. Install/configure Linux OS, file systems, and TCP/IP networking; troubleshoot OS and application issues. Automate/administer via BASH scripting; compile/install software as required. Use common operations and observability tooling: Jira, Confluence, Grafana, Prometheus, Nagios. Support HPC workload and configuration management tooling: Slurm, git, Salt, Ansible. Provide user support and escalation/status communication to agency management and internal customers. Optimize operations through resource utilization and capacity analysis/planning. Apply in-depth troubleshooting skills across heterogeneous systems (no single fixed solution). Provide detailed analysis and feedback to agency management and internal customers for escalated tickets. Provide support for the dispatch system and hardware problems and remains involved in the resolution process. Harden, patch, and tune Linux/UNIX/Windows systems; implement OS-level enhancements to improve reliability and performance. Support the design of systems, mission architecture and associated hardware. Possess a working knowledge and understanding of system administration interdependencies as part of the Service Oriented Architecture (SOA). Analyze and resolve complex problems associated with server hardware, applications and software integration.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
5,001-10,000 employees