About The Position

Do you want to be part of a team that's revolutionizing the field of AI with data center scale solutions? We are looking for a hardworking Solution Architect Manager with experience in designing, building, and maintaining large-scale HPC and AI infrastructure to join our team at NVIDIA. As Solution Architects, we are actively helping make AI Factories a reality. Our team helps enable some of the industry's largest Solution Providers who serve as our trusted partners. We help partners to understand and adopt our reference architectures and libraries through training and workshops; we help them develop robust NVIDIA practices and support customer conversations; and we help advise them through their most important data center deployments. This is where you come in! What you'll be doing: You will be responsible for managing a team of infrastructure experts passionate about delivery and bring up of NVIDIA-powered AI Factories. The ideal candidate will have excellent interpersonal skills to contribute to a dynamic customer focused team. This role will be advising and assisting partners as they define and implement large scale AI/HPC projects. Your primary focus would be on understanding the AI workload and how it interacts with other parts of the system like networking, storage, deep learning frameworks, data cleaning tools, etc. You must be passionate about partner success, and driving AI adoption to the enterprise.

Requirements

  • BS/MS/PhD or equivalent experience in Computer Science, Data Science, Electrical/Computer Engineering, Physics, Mathematics, other Engineering fields.
  • 8+ overall years work or research experience with Python/ C++ / other software development.
  • 4+ years of experience leading a team.
  • Track record of medium to large scale AI training and understanding of key libraries used for NLP/LLM/VLA training (NeMo Framework, DeepSpeed etc.)
  • Experience with integration and deployment of software products in production enterprise environments, and microservices software architecture.
  • Solid understanding of data center infrastructure: servers, storage, networking, cabling, power, cooling, and physical deployment workflows.
  • Experience with software microservices and with the incorporation and delivery of software in production environments
  • Technical leadership and strong understanding of NVIDIA technologies, and success in working with customers.
  • Excellent verbal, written communication, and technical presentation skills in English.

Nice To Haves

  • Understanding of HPC systems: data center design, high speed interconnect InfiniBand, Cluster Storage and Scheduling related design and/or management experience.
  • Strong coding and debugging skills, and demonstrated expertise in one or more of the following areas: Machine Learning, Deep Learning, Slurm, Docker/Kubernetes, Kubernetes, Singularity, MPI, MLOps, LLMOps, Ansible, Terraform, and other high-performance AI cluster solutions.
  • Hands-on experience with HPC clusters, InfiniBand, GPU infrastructure, or hyperscale data center technologies.
  • Experience in AI infrastructure deployment, professional services, or tech vendor post-sales delivery.

Responsibilities

  • Managing and developing a group of infrastructure and HPC specialists
  • Providing guidance and support to partners, helping them successfully deploy and bring up AI Factories
  • Helping our partners employ our best practices and reference architectures and taking your knowledge out to the field
  • Raising and providing timely advance alerts of critical customer issues that need further focus

Benefits

  • You will also be eligible for equity and benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service