NVIDIA has been reinventing computer graphics, PC gaming, and accelerated computing for 30 years. It is a unique legacy of innovation that’s fueled by great technology and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, generative AI , robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. We are looking for a highly skilled Principal Software Engineer to design and develop AIOps & Observability platforms at NVIDIA. The platforms are used by internal teams to monitor, diagnose, and optimize the products, millions of assets and services in cloud, on-prem, data centers, supply chain, and edge. You will work with a team of engineers, product managers, and partners to define the observability strategy, roadmap, and standard methodologies for NVIDIA. You will also mentor and coach other engineers on observability, machine learning, tools and techniques. What you will be doing: Lead the design, development, and deployment of AIOps & Observability platforms, including metrics, logs, traces, events, alerts, dashboards, and visualizations. Drive the technical vision and roadmap for AIOps and Observability initiatives, aligning with business goals and industry best practices. Collaborate with other teams and customers to understand their observability needs and provide solutions that meet their requirements and expectations. Establish and implement observability standards, guidelines, and processes across NVIDIA. Research, evaluate, and adopt new observability technologies and frameworks that can enhance user experience. Provide peer reviews to other engineers including feedback on performance, scalability, security and correctness. Work with Data scientists to implement machine learning models for anomaly detection, forecasting, and root cause analysis on logs, metrics, and events. Handle large volumes of data and ensure data quality, security, and compliance. Develop and operate scalable, reliable, and distributed systems that can handle high traffic and complex workloads. Find opportunities to automate remediation of commonly occurring issues to operate systems reliably and efficiently.