In this role, you’ll make an impact in the following ways: Be hands-on with enterprise-grade NVIDIA AI infrastructure, supporting GPU-based compute, high-performance storage, and network systems designed for ML/AI at scale. Deploy, monitor, and troubleshoot containerized AI workloads using Kubernetes, Docker, and GPU orchestration tools like Run:AI and NVIDIA BCM. Own the observability of our AI platforms—monitor health, identify performance bottlenecks, and make strategic recommendations to drive platform reliability and maturity. Automate infrastructure operations and provisioning using Python, Bash, and tools like Terraform or Ansible to reduce manual toil and accelerate experimentation. Maintain and scale AI training and inference pipelines, integrating infrastructure workflows into CI/CD systems to enable seamless, automated deployment of AI workloads.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
251-500 employees