AI Operations & Infrastructure Engineer

Invictus International Consulting, LLC•Fort Meade, MD

3d•Onsite

About The Position

We are seeking a skilled AI Operations & Infrastructure Engineer to manage and maintain our AI computing platforms. This role involves overseeing the entire AI software stack and tools, implementing containerization technologies, and configuring networking infrastructure for AI workloads. You will be responsible for managing storage solutions, deploying data processing units (DPUs), and monitoring cluster health and resource utilization. The position requires expertise in workload management, ensuring efficient power and cooling, and optimizing network performance for AI and machine learning computations. You will also integrate NVIDIA networking products, deploy networking solutions in data centers, and provide technical support to teams managing AI infrastructure. Collaboration with data scientists, researchers, and IT professionals is key, as is leading the deployment and validation of servers and systems for AI-enabled platforms. Responsibilities include configuring network topologies, BMC, OOB, TPM, power, and cooling, as well as installing, upgrading, and validating GPU-based servers, BlueField DPUs, cables, and transceivers. Firmware upgrades, hardware validation, storage setup, and administration of physical and logical resources are also part of the role. You will install and configure operating systems, cluster software, drivers, containers, and NGC CLI, and manage clusters using various orchestration tools. Performing stress, benchmarking, and burn-in tests, verifying system components, and troubleshooting hardware, software, storage, and performance issues are essential. The role also involves replacing faulty components, optimizing systems, and monitoring, documenting, and reporting on cluster health and performance to ensure secure, efficient, and scalable operation of NVIDIA AI infrastructure.

Requirements

Qualified candidates must hold an active NVIDIA Professional Certification in either AI Networking, AI Infrastructure, or AI Operations
Prior direct, hands-on professional experience administering NVIDIA GPU and data processing unit (DPU) technologies, AI software stacks, and data center environments for high-performance AI workloads
Comprehensive expertise in deploying and maintaining AI compute platforms, requiring proficiency in containerization and workload orchestration using Docker, Kubernetes, Slurm, NVIDIA Base Command Manager, and Run:Ai
Must be capable of configuring physical and logical resources, including Multi-Instance GPU (MIG) partitioning and BlueField platforms, while overseeing critical facility elements such as power, cooling, and storage solutions
The ability to demonstrate advanced skills in AI networking, specifically configuring and optimizing high-performance InfiniBand and Ethernet fabrics to ensure maximum throughput and minimal latency
Current active TS/SCI clearance with a CI Polygraph

Responsibilities

Manage and maintain AI computing platforms, including GPUs and other specialized hardware
Install and configure GPU drivers and software
Oversee the AI software stack and tools
Implement and manage containerization technologies like Docker and Kubernetes
Configure and optimize networking infrastructure for AI workloads, including InfiniBand and Ethernet
Manage storage solutions for AI data, considering performance and capacity requirements
Deploy and manage data processing units (DPUs) to accelerate data center workloads
Monitor and manage AI cluster health and resource utilization
Implement workload management and scheduling tools like Slurm and Kubernetes
Ensure efficient power and cooling for AI infrastructure to maintain optimal operating conditions
Configure high-performance networking solutions for AI and machine learning workloads
Optimize network performance to ensure maximum throughput and minimal latency for AI computations
Implement and fine-tune network protocols to enhance data transfer speeds and efficiency
Integrate NVIDIA networking products with existing AI infrastructure, including servers, GPUs, and storage systems
Deploy networking solutions in data centers to ensure seamless connectivity between AI components
Diagnose and resolve networking issues impacting AI workloads to maintain optimal system performance
Provide technical support and guidance to teams managing AI infrastructure
Collaborate with data scientists, researchers, and IT professionals to understand networking requirements and challenges
Lead deployment and validation of servers and systems for AI enabled platforms
Configure and manage network topologies, BMC, OOB, TPM, power, and cooling
Install, upgrade, and validate GPU-based servers, BlueField DPUs, cables, and transceivers
Perform firmware upgrades, hardware validation, and storage setup
Configure and administer physical and logical resources, including M IG partitioning and BlueField platforms
Install and configure operating systems, cluster software, drivers, containers (Docker), and NGC CLI
Manage and orchestrate clusters using NVIDIA Base Command Manager, Slurm, Pyxis, Enroot, and Run:Ai
Perform stress, benchmarking, and burn-in tests using HPL, NCCL, NVIDIA Nemo, and ClusterKit
Verify cabling, firmware/software versions, and network signal quality
Troubleshoot and resolve hardware, software, storage, and performance faults
Replace faulty components and optimize systems for AMD/Intel platforms
Monitor, document, and report on cluster health, resource usage, and job performance
Ensure secure, efficient, and scalable operation of NVIDIA AI infrastructure, including user access and workload management