Site Reliability Engineer

Microsoft•Atlanta, GA

32d

About The Position

Leverages technical expertise and telemetry analysis alongside advanced artificial intelligence (AI) and machine learning (ML) algorithms across a range of components and/or features to identify patterns and opportunities to implement configuration and data changes for one or more platforms, systems, or products in production using code, tooling, and automation. Independently writes code or scripts that automate the performance of scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale. Supports ongoing engagements with product engineering teams by participating in code/design reviews, regular meetings, on-call rotations, and incident responses throughout product development and operations cycles. Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, taking appropriate action to mitigate impact, and deploying appropriate fixes to resolve root cause(s). Notifies product teams and owners to major customer impacting issues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed. Communicates details and resolutions through post-mortem reports and review meetings. Develops an understanding of how to safely and reliably manage changes in production by using existing tools and automation to enable product engineering teams implement changes across a defined range of components or features, with direction from other engineers. Develops alerts and instrumentation across components and features to monitor product capacity, related security risk, and resource demands and analyze telemetry data using existing capacity planning models. Draws insights from analyses of capacity and resource data to optimize component and feature code to manage resources and capacity across limited range of use conditions and system parameters. Designs, develops, and maintains telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of product components and features operating at scale. Independently performs analyses using existing tools and/or models to identify insights and shares them with product engineering teams to directly contribute to improvements in product development and/or operations. Monitors the impact of changes on operations metrics (e.g., Time-to-X). Demonstrates expertise in distributed systems design, interactions between cloud technology layers and components, common dependencies at scale, and the code that defines infrastructures. Can identify and recommend configurations optimal of cloud technology solutions and modify the code base that defines systems or cloud technologies to improve the security, quality, reliability, and operability of supported products with minimal guidance from other engineers.

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
These requirements include but are not limited to the following specialized security screenings:
Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
2+ years technical experience working with large-scale cloud or distributed systems.

Responsibilities

Leverages technical expertise and telemetry analysis alongside advanced artificial intelligence (AI) and machine learning (ML) algorithms across a range of components and/or features to identify patterns and opportunities to implement configuration and data changes for one or more platforms, systems, or products in production using code, tooling, and automation.
Independently writes code or scripts that automate the performance of scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale.
Supports ongoing engagements with product engineering teams by participating in code/design reviews, regular meetings, on-call rotations, and incident responses throughout product development and operations cycles.
Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, taking appropriate action to mitigate impact, and deploying appropriate fixes to resolve root cause(s).
Notifies product teams and owners to major customer impacting issues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed.
Communicates details and resolutions through post-mortem reports and review meetings.
Develops an understanding of how to safely and reliably manage changes in production by using existing tools and automation to enable product engineering teams implement changes across a defined range of components or features, with direction from other engineers.
Develops alerts and instrumentation across components and features to monitor product capacity, related security risk, and resource demands and analyze telemetry data using existing capacity planning models.
Draws insights from analyses of capacity and resource data to optimize component and feature code to manage resources and capacity across limited range of use conditions and system parameters.
Designs, develops, and maintains telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of product components and features operating at scale.
Independently performs analyses using existing tools and/or models to identify insights and shares them with product engineering teams to directly contribute to improvements in product development and/or operations.
Monitors the impact of changes on operations metrics (e.g., Time-to-X).
Demonstrates expertise in distributed systems design, interactions between cloud technology layers and components, common dependencies at scale, and the code that defines infrastructures.
Can identify and recommend configurations optimal of cloud technology solutions and modify the code base that defines systems or cloud technologies to improve the security, quality, reliability, and operability of supported products with minimal guidance from other engineers.