Nvidia-posted 4 months ago
$184,000 - $287,500/Yr
Full-time • Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

NVIDIA is looking for a Field Escalation Solution Architect with experience in validation and debugging of large-scale GPU clusters focused on performance. As part of the Solution Architecture organization, we work with the most sophisticated computing hardware and software, driving the latest deep learning and machine learning breakthroughs with NVIDIA's enterprise customers. This role offers an excellent opportunity to build your career in the rapidly growing field of deep learning while enabling the world's most successful technology companies.

  • Validate and debug customer cluster performance issues and functional bottlenecks.
  • Drive customer technical engagements around NVIDIA products and technologies.
  • Help architect and scale high-performance, distributed AI infrastructure on-prem or in the cloud.
  • Address and resolve problems from the bare metal level to the application level.
  • Share knowledge with different teams by delivering demos, assisting with proof-of-concepts, and writing papers and developer blogs.
  • Collaborate with executives and engineering to address sophisticated problems.
  • Work directly with developers and hardware architects to debug cluster performance issues.
  • Provide additional expertise to enable the account team to be more adaptable to the customer.
  • Build custom product demonstrations and POCs for solutions that address critical business needs.
  • BS, MS, or PhD in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or other Engineering fields or equivalent experience.
  • 8+ years of work-related experience in NVIDIA and/or accelerated computing technologies.
  • Platform level understanding of server architecture, PCIe topology, GPUs, NICs, Linux OS and kernel drivers.
  • Networking experience, including knowledge of Ethernet, InfiniBand or other networking protocols.
  • Experience working with DevOps on-prem or in cloud environments, including Docker/Containers, cloud APIs, IaaS and Data Center deployments.
  • SLURM, Kubernetes, and/or other job scheduler use, deployment, and debugging skills.
  • Deep understanding of dense data center design, including computing, storage, networking, cloud APIs, and IaaS.
  • Effective time management and capable of balancing multiple tasks.
  • Strong analytical and problem-solving skills.
  • Strong communication skills, both written and verbal.
  • Demonstrated Communication Collectives (NCCL) experience.
  • Excellent customer-facing skills and background.
  • Platform design engineering, coding and proficient debugging skills including experience in C/C++, Linux kernel, virtualization and drivers, profilers/performance analysis tools (NSys).
  • Familiarity with NVIDIA systems/SDKs (e.g. CUDA), NVIDIA Networking technologies (e.g., RoCE, InfiniBand), Switch interconnects and/or ARM CPU solutions.
  • Understanding of Deep Learning and Machine Learning frameworks (TensorFlow or PyTorch), LLM, MLOps, DevOps, and workflows applying cloud technologies.
  • Highly competitive salaries.
  • Comprehensive benefits package.
  • Equity eligibility.
  • Excellent engineering culture.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service