Site Reliability Engineer - Customer Response Team

Microsoft•Redmond, WA

40d

About The Position

Develops technical expertise in the code, features, and operations of specific products as required to identify opportunities to improve product supportability, availability, reliability, efficiency, observability, and/or performance; actively participates in on-boarding, code/design reviews, and regular meetings with engineering teams that develop and/or manage those products. Develops, tests, and implements changes to optimize code and improve the observability, reliability and operability of components and features of one or more platforms, systems, or products operating at scale. Leverages technical expertise in large scale distributed systems and specific products, as well as objective insights drawn from analyses of production telemetry data to suggest changes or add-ons to product features or code to improve the availability, reliability, efficiency, observability, and performance of product components or features supported by their team. Engages with product engineering teams by participating code/design reviews, regular meetings, on-call rotations and incident responses throughout product development and operations cycles; leverages technical expertise on underlying systems/platforms and insights drawn from engagements with product engineering teams and telemetry analyses to propose potential improvements in code base and designs across components and features of one or more products. Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, and deploying appropriate fixes to resolve root cause(s); alerts product teams and owners to major customer impacting issues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed. Shares details related to incidents and their resolution through post-mortem reports and during regular review meetings. Independently uses existing tools and/or models to troubleshoot problems or flaws affecting the availability, reliability, performance, and/or efficiency of components and features; proposes solutions that will resolve and prevent recurring issues and brings them to the attention of their Site Reliability Engineering (SRE) and/or product engineering teams. Leverages technical expertise and telemetry analysis across a range of components and/or features to identify patterns and opportunities to implement configuration and data changes for one or more platforms, systems, or products in production using code, tooling, and automation. Embody our culture and values.

Requirements

Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
1+ years technical engineering experience with large scale, distributed systems such as Azure, AWS, or Google Cloud.
These requirements include but are not limited to the following specialized security screenings
Master's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 5+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
2+ years technical experience working with large-scale cloud or distributed systems.

Responsibilities

Develops technical expertise in the code, features, and operations of specific products as required to identify opportunities to improve product supportability, availability, reliability, efficiency, observability, and/or performance
Actively participates in on-boarding, code/design reviews, and regular meetings with engineering teams that develop and/or manage those products.
Develops, tests, and implements changes to optimize code and improve the observability, reliability and operability of components and features of one or more platforms, systems, or products operating at scale.
Leverages technical expertise in large scale distributed systems and specific products, as well as objective insights drawn from analyses of production telemetry data to suggest changes or add-ons to product features or code to improve the availability, reliability, efficiency, observability, and performance of product components or features supported by their team.
Engages with product engineering teams by participating code/design reviews, regular meetings, on-call rotations and incident responses throughout product development and operations cycles
Leverages technical expertise on underlying systems/platforms and insights drawn from engagements with product engineering teams and telemetry analyses to propose potential improvements in code base and designs across components and features of one or more products.
Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, and deploying appropriate fixes to resolve root cause(s)
Alerts product teams and owners to major customer impacting issues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed.
Shares details related to incidents and their resolution through post-mortem reports and during regular review meetings.
Independently uses existing tools and/or models to troubleshoot problems or flaws affecting the availability, reliability, performance, and/or efficiency of components and features
Proposes solutions that will resolve and prevent recurring issues and brings them to the attention of their Site Reliability Engineering (SRE) and/or product engineering teams.
Leverages technical expertise and telemetry analysis across a range of components and/or features to identify patterns and opportunities to implement configuration and data changes for one or more platforms, systems, or products in production using code, tooling, and automation.
Embody our culture and values.