Site Reliability Engineer

Microsoft•Redmond, WA

74d•$119,800 - $234,700

About The Position

The Firmware Deployment team within Microsoft's Silicon Cloud Hardware Infrastructure Engineering (SCHIE) organization is responsible for building and operating world-class software and data-driven services that support Azure's hardware infrastructure development. Our mission is to enable safe, reliable, and intelligent deployment of firmware payloads across the Azure fleet, ensuring system health and operational quality at scale. We are seeking a Site Reliability Engineer within the Firmware Deployment team, you will be instrumental in shaping the future of the Azure Fleet. Your primary responsibility will involve developing and applying stable firmware releases across the GPU fleet, as well as potentially supporting other related environments. This work is essential to maintain Microsoft's security and performance standards while delivering an outstanding experience for our customers. Your efforts in deploying and managing firmware updates will ensure the reliability and efficiency of Azure's hardware infrastructure. By focusing on stability and operational excellence, you will help safeguard system health and contribute to the ongoing success and growth of Azure's global infrastructure.

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
3+ years of experience in software engineering or operations for large-scale distributed systems
Ability to support a 24x7 data center environment, including participation in an on-call rotation and availability during non-standard business hours (evening, nights, weekends, or holidays) as operational needs require
Proficiency in one or more programming languages (C#, Python, Go, or similar)
Understanding of cloud infrastructure (Azure preferred), networking, and system design
Familiarity with monitoring tools, incident management frameworks, and DevOps practices
Problem-solving and debugging skills
Ability to meet Microsoft, customer and/or government security screening requirements

Responsibilities

Build and bring specialized knowledge across multiple production aspects (monitoring, release engineering, testing, live site excellence, buildout, performance optimization, capacity management)
Analyze large-scale telemetry and operational data to uncover insights and drive data-informed decisions
Use the proven set of principles and practices such as safe deployment, testing for reliability, single point of failures elimination, disaster recovery, SLOs based monitoring, throttling, infrastructure management automation, post-mortem excellence, and adoption of common systems
Respond to alerts and incidents
Build and follow playbooks to drive root cause analysis and reviews
Partner with hardware and firmware teams to understand system behavior and identify opportunities for predictive analytics
Participate in an on-call rotation and availability during non-standard business hours and contribute to service reliability and incident resolution

Benefits

Industry leading healthcare
Educational resources
Discounts on products and services
Savings and investments
Maternity and paternity leave
Generous time away
Giving programs
Opportunities to network and connect

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Professional, Scientific, and Technical Services

Education Level

Master's degree

Site Reliability Engineer

About The Position

Requirements

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company