NVIDIA has become the platform upon which every new AI-powered application is built. From healthcare research applications to autonomous vehicles, or voice-recognition systems, there is a need to simplify and deliver predictability for AI applications and workflows ... and NVIDIA is right in the center of this revolution. Resiliency and Observability are key to delivering customer value and exhilarating customer experience. This product manager will lead the development of foundational tools dedicated to ensuring the resiliency and observability of large-scale accelerated computing platforms. By creating essential tools for system diagnostics, performance monitoring, and automated recovery, they will empower customers to confidently operate both complex AI training and demanding inference workloads with maximum uptime and efficiency. What you will be doing: Be a subject‑matter expert on resiliency and observability. Deeply understand failure modes across the GPU hardware, network, and software stack, along with the telemetry signals that reveal them, and how they correlate to workload health and SLOs. Master modern reliability architectures. Keep up-to-date with the industry trends. Build for all that want to use. Drive joint project planning. Define concrete achievements, tasks, and work for resiliency and observability initiatives with external partners. Fuel innovation in reliability tooling. Lead ideation sessions to propose novel approaches and shape new proof‑of-concepts. Bridge development, SRE, and partner teams. Facilitate clear communication, triage emergent issues rapidly, and ensure feedback loops between engineering and customer operations remain tight. Coordinate execution across different functions. Work with engineering, design, operations, sales, and marketing to embed resiliency and observability requirements into every product launch, capacity expansion, and lifecycle transition.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Number of Employees
5,001-10,000 employees