Senior Solutions Architect, AI Cluster Performance and Telemetry

NVIDIAAustin, TX
$184,000 - $356,500

About The Position

We are looking for a Senior Solutions Architect specializing in Data Center Systems & Performance to join our elite solutions architecture team. In this role, you will work at the intersection of groundbreaking hardware and complex software stacks. As a Solutions Architect, you will act as a pivotal technical expert uniting engineering, field teams, and customers with highly intensive requirements. You will be responsible for analyzing and optimizing the performance of world-class AI, deep learning, and HPC ecosystems.

Requirements

  • BS or MS in Engineering, Electrical Engineering, Physics, or Computer Science (or equivalent experience).
  • 8+ years of work-related experience in the high-tech industry, particularly in system build, performance analysis, and technical customer-facing roles.
  • A strong understanding of how CPUs, GPUs, and high-speed networking fabrics interact within massive clusters.
  • Practical experience with performance counters, profiling tools, and telemetry collection systems (e.g., Perf, eBPF, Prometheus, Grafana).
  • Practical experience working with containers, cloud provisioning, and scheduling tools such as Docker, Docker Swarm, Kubernetes, SLURM, Ansible.
  • Proven track record of transforming raw logs and telemetry into structured time series data, dashboards, and heat maps.
  • The ability to translate complex, low-level technical performance anomalies into clear, actionable narratives for cross-functional teams.
  • Strong collaborative skills and a proven history of building successful relationships across diverse engineering and operations teams.

Nice To Haves

  • Deep knowledge of multi-GPU communication libraries like NCCL, and how they optimize inter-GPU topologies.
  • Deep, hands-on experience working directly with NVIDIA hardware architectures, NVLink, NVSwitch, or NVIDIA Nsight tools.
  • Practical experience optimizing distributed AI training workloads, LLMs, or large-scale high-performance computing environments.
  • Experience developing or integrating Agentic AI frameworks to autonomously parse telemetry logs, diagnose configuration drifts, or automate cluster triage.

Responsibilities

  • Work together with our partners and customers to identify, analyze, and resolve complex performance bottlenecks across interconnected GPU, CPU, and networking systems.
  • Complete and maintain robust performance benchmarking suites to stress-test high-performance clusters and establish performance baselines.
  • Apply industry-standard performance tools to monitor hardware performance counters and extract deep system telemetry.
  • Deeply investigate system and software configurations to find and fix subtle discrepancies that impact peak performance.
  • Partner closely with internal engineering units and outside collaborators and customers to collectively develop solutions and boost infrastructure performance.

Benefits

  • equity
  • benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service