Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team is responsible for managing the core platform & fleet of AI High Performance Computing products that customers use to run their most performant and demanding workloads. The AI Customer Experience (AICE) engineering team within the HPC & AI Eng. team is on the frontlines managing the flagship supercomputers used by top tier AI customers that enable breakthroughs such as ChatGPT and are highlighted in Top500, MLPerf and Graph500 rankings. We run lean, obsess about customer experience and use evidence-based approach to decision making. We have live-site first, metrics-driven culture that prevents us from accumulating debt and necessity to put out fires on daily basis. You will be in a position that carries a ton of responsibility and provides opportunities to directly impact customers satisfaction. As a Senior Software, you will design and develop capabilities needed to monitor and efficiently operate across the infrastructure and fleet of supercomputers at scale. You will diagnose and troubleshoot the largest scale supercomputing systems across the infrastructure stack (GPU hardware, networking, datacenter and core software). To enable first to know of critical incidents impacting customer capacity, you will create end to end data pipelines that process and synthesize large volume of telemetry, log files and other data sources to create actionable alerts. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
5,001-10,000 employees