Site Reliability Engineer

Microsoft•Redmond, WA

13h

About The Position

Microsoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further, offering more possibilities, innovation, and openness in a cloud-enabled world. Microsoft’s Azure Data engineering team is seeking a Site Reliability Engineer. This team is at the forefront of analytics transformation with products like databases, data integration, big data analytics, messaging & real-time analytics, and business intelligence, including Microsoft Fabric, Azure SQL DB, Azure Cosmos DB, Azure PostgreSQL, Azure Data Factory, Azure Synapse Analytics, Azure Service Bus, Azure Event Grid, and Power BI. Their mission is to build the data platform for the age of AI, powering data-first applications and fostering a data culture. Within Azure Data, the Microsoft Fabric platform team builds and maintains the operating system, providing customers a unified data stack for their entire data estate, offering a unified experience, governance, business model, and architecture. The Site Reliability Engineering (SRE) team ensures the reliability, scalability, and performance of systems and services by integrating software engineering with IT operations. They automate processes, manage incidents, and enhance system resilience, acting as a bridge between development and operations to maintain highly reliable and efficient systems while enabling fast and seamless software delivery. Microsoft values diversity and different perspectives to better serve customers and empowers every person and organization to achieve more through a growth mindset, innovation, and collaboration, upholding values of respect, integrity, and accountability to create an inclusive culture.

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Nice To Haves

5+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration.
4+ years technical experience in software engineering, network engineering, or systems administration OR bachelor's degree in computer science, Information Technology, or related field AND 2+ year(s) technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field.
2+ years’ experience with scripting languages such as PowerShell, Python etc.
Experience writing code to automate day-to-day tasks.

Responsibilities

Work with all aspects of a high throughput and multi-tenant service
Collaborate effectively within the team and with partner teams across Microsoft.
Be part of the on-call rotation for maintaining service health.
Design, implement, and refine chosen solutions in close partnership with Product Management and partner teams.
Champion operational excellence via established metrics, process governance, and policy controls for regular assessment and improvement.
Document and define existing data engineering processes, data and technology, while evaluating them for optimization.
Ensuring high availability of services (System Reliability & Uptime)
Detecting, responding to, and mitigating system failures (Incident Management)
Tracking system health and resolving bottlenecks (Performance Monitoring)
Reducing manual work through scripts and automation (Automation & Tooling)
Scaling infrastructure efficiently to handle demand (Capacity Planning)
Analyzing failures to prevent recurrence (Postmortems & Continuous Improvement)
Embody our culture and values

Benefits

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume