Site Reliability Engineer

Microsoft•Atlanta, GA

29d•$100,600 - $199,000•Remote

About The Position

We’re looking for a Site Reliability Engineer (SRE) with the right zeal to contribute to systems engineering, software development and passion for quality to envision, design, and deliver Office 365 (O365) Enterprise Cloud service offerings. Team Overview: Within the vast framework of M365 Office Engineering Direct (OED), our SRE team is instrumental to the success of Exchange Online. With the service spanning hundreds of components, our goal is clear: ensure unmatched service availability and continually elevate user satisfaction. What We Do & Our Impact: Our approach is layered and precise. By implementing proactive engineering solutions, we identify and tackle incidents head-on, ensuring limited disruptions. Monitoring, both comprehensive and nuanced, remains our cornerstone, adeptly capturing anomalies beyond the scope of conventional systems. As swift diagnostics steer our course, we channel our efforts towards automation, efficiently managing the incident lifecycle from detection to resolution. Additionally, with a commitment rooted in understanding our users, we meticulously prioritize and execute Design Change Requests, ensuring Exchange Online's evolution aligns with user expectations. Artificial Intelligence (AI) & Machine Learning (ML) in Focus: As we look to the horizon, the fusion of AI and ML with our SRE practices beckons a transformative era for Exchange Online. We are in the early stages of integrating predictive analytics to anticipate issues before they manifest, allowing us to stay a step ahead. Customized ML models are being developed to intelligently sift through vast data lakes, identifying patterns and correlations previously overlooked. Our journey with AI and ML is not just about enhancement; it's about redefining reliability, precision, and the user experience in the M365 suite. Location: By applying to this U.S. based position, while remote work is possible, relocation does not apply/is not provided for the role Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Nice To Haves

Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
2+ years technical experience working with large-scale cloud or distributed systems.

Responsibilities

Leverages technical expertise and telemetry analysis alongside advanced artificial intelligence (AI) and machine learning (ML) algorithms across a range of components and/or features to identify patterns and opportunities to implement configuration and data changes for one or more platforms, systems, or products in production using code, tooling, and automation.
Independently writes code or scripts that automate the performance of scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale.
Supports ongoing engagements with product engineering teams by participating in code/design reviews, regular meetings, on-call rotations, and incident responses throughout product development and operations cycles.
Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, taking appropriate action to mitigate impact, and deploying appropriate fixes to resolve root cause(s). Notifies product teams and owners to major customer impacting issues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed. Communicates details and resolutions through post-mortem reports and review meetings.
Develops an understanding of how to safely and reliably manage changes in production by using existing tools and automation to enable product engineering teams implement changes across a defined range of components or features, with direction from other engineers.
Develops alerts and instrumentation across components and features to monitor product capacity, related security risk, and resource demands and analyze telemetry data using existing capacity planning models. Draws insights from analyses of capacity and resource data to optimize component and feature code to manage resources and capacity across limited range of use conditions and system parameters.
Designs, develops, and maintains telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of product components and features operating at scale. Independently performs analyses using existing tools and/or models to identify insights and shares them with product engineering teams to directly contribute to improvements in product development and/or operations.
Demonstrates expertise in distributed systems design, interactions between cloud technology layers and components, common dependencies at scale, and the code that defines infrastructures. Can identify and recommend configurations optimal of cloud technology solutions and modify the code base that defines systems or cloud technologies to improve the security, quality, reliability, and operability of supported products with minimal guidance from other engineers.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume