Are you a customer-obsessed, AI-curious problem-solver who thrives in an inclusive, collaborative global team? Join Engineering Operations (EngOps) – the organization driving operational excellence across the Microsoft Cloud to strengthen quality, reliability, security, and customer trust. As part of EngOps, you’ll design solutions that prevent issues before they happen, embed AI-powered automation, and turn signals into actions that deliver measurable customer impact. Our culture of empowerment, inclusion, and growth mindset defines how we work. The Customer Reliability Engineering (CRE) team within Azure CXP is a top-level pillar of Azure Engineering responsible for world-class live-site management, customer reliability engagements, modern customer-first experiences for scale, and drives deep customer insights and empathy into the broader Azure Engineering organization. Our “no dead-end’s” philosophy ensures that every customer, regardless of size or scale, can realize their full potential through the Microsoft Cloud We are seeking decisive and experienced Service Engineers for Live Site Issues, Problem Management and driving Customer reliability space. This role is accountable for enhancing the customer experience across Azure, including First Party Services. The ideal candidate will demonstrate strong breadth in managing complex, highly available services, paired with deep technical expertise in Azure Core Services and their inter dependencies. You will work closely with Customers, First Parties, Customer Support, Livesite, and Engineering teams to deliver critical, customer-facing features. Success in this role requires the ability to influence and collaborate across many Azure servicing teams to ensure customer needs are met. In addition, this role includes on-call responsibilities for managing and resolving complex multi-service outages. It requires the ability to remain effective under pressure, apply broad technical and analytical skills, and coordinate seamlessly with internal service teams and stakeholders. Strong communication skills—both written and verbal—are essential. You will also lead the evolution of Azure's Incident Management practice through Post-Incident Reviews, process development, and system automation. By leveraging telemetry and metrics, you will identify and drive platform-wide improvements with global impact. You’ll be the single point of command and control during high-severity incidents, orchestrating cross-functional engineering, operations, and communications to minimize impact, restore services quickly, and protect the trust of our global customer base. This role offers a unique opportunity to make immediate impact, improve systems at scale.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
5,001-10,000 employees