Site Reliability Engineer - CTJ - Secret

Microsoft•,

1d•$102,100 - $219,200

About The Position

Microsoft Substrate is the foundational cloud platform that powers many of Microsoft’s most critical services including Exchange Online and M365 Copilot, providing shared infrastructure, identity, messaging, storage, and service-to-service capabilities used across Microsoft 365 and related cloud offerings. Substrate services operate at global scale and are designed to deliver high availability, reliability, and security for some of the world’s most demanding workloads. As a Site Reliability Engineer II, you will take ownership of reliability and operational outcomes for specific components or services. You will independently diagnose and resolve production issues, design and implement automation to reduce toil, and contribute to service improvements that enhance availability, scalability, and efficiency. This role requires deeper technical judgment, stronger software engineering fundamentals, and close collaboration with partner teams to ensure reliability, diagnosability, security, and compliance are built into services from design through operation—particularly for services operating in highly-regulated environments. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
4+ years technical experience in software engineering, network engineering, or systems administration.
Candidates must be able to meet Microsoft, customer and/or government security screening requirements required for this role.
This role requires access to Microsoft Government cloud environments, including GCC Moderate (GCCM), GCC High (GCCH), and Department of Defense (DoD) environments.
The successful candidate must be able to obtain and maintain the appropriate background investigations and customer screenings required for access to these environments.
For access to GCCH and DoD environments, this role requires the ability to obtain and maintain a favorably adjudicated Tier 3 (T3) background investigation.
For access to GCCM environments, this role requires the ability to meet Criminal Justice Information Services (CJIS) eligibility requirements.
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Nice To Haves

Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
2+ years technical experience working with large-scale cloud or distributed systems.

Responsibilities

Own reliability and operational health for one or more Substrate components or services in highly regulated environments.
Serve as an actively engaged on-call engineer (OCE), participating in an on-call rotation and independently responding to incidents for owned services.
Respond to, diagnose, and resolve production incidents with minimal supervision.
Design and implement automation to reduce operational toil and improve service stability.
Develop and maintain monitoring, alerting, and telemetry to support SLOs and operational metrics.
Lead post-incident reviews for owned incidents, focusing on root cause analysis and durable fixes.
Collaborate with software engineering teams to embed reliability and operability into service design.
Write and maintain production-quality code and automation that improves reliability, scalability, and operational efficiency.