We are seeking a skilled AI Operations & Infrastructure Engineer to manage and maintain our AI computing platforms. This role involves overseeing the entire AI software stack and tools, implementing containerization technologies, and configuring networking infrastructure for AI workloads. You will be responsible for managing storage solutions, deploying data processing units (DPUs), and monitoring cluster health and resource utilization. The position requires expertise in workload management, ensuring efficient power and cooling, and optimizing network performance for AI and machine learning computations. You will also integrate NVIDIA networking products, deploy networking solutions in data centers, and provide technical support to teams managing AI infrastructure. Collaboration with data scientists, researchers, and IT professionals is key, as is leading the deployment and validation of servers and systems for AI-enabled platforms. Responsibilities include configuring network topologies, BMC, OOB, TPM, power, and cooling, as well as installing, upgrading, and validating GPU-based servers, BlueField DPUs, cables, and transceivers. Firmware upgrades, hardware validation, storage setup, and administration of physical and logical resources are also part of the role. You will install and configure operating systems, cluster software, drivers, containers, and NGC CLI, and manage clusters using various orchestration tools. Performing stress, benchmarking, and burn-in tests, verifying system components, and troubleshooting hardware, software, storage, and performance issues are essential. The role also involves replacing faulty components, optimizing systems, and monitoring, documenting, and reporting on cluster health and performance to ensure secure, efficient, and scalable operation of NVIDIA AI infrastructure.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed