Senior Site Reliability Engineer

MicrosoftRedmond, WA
73d$119,800 - $234,700

About The Position

Microsoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further. This is a world of more possibilities, more innovation, more openness, and the sky is the limit thinking in a cloud-enabled world. Microsoft's Azure Data engineering team is leading the transformation of analytics in the world of data with products like databases, data integration, big data analytics, messaging & real-time analytics, and business intelligence. Our mission is to build the data platform for the age of AI, powering a new class of data-first applications and driving a data culture. We are looking for a self-driven Senior Site Reliability Engineer (SRE) who likes taking a data driven and systems-based approach to solve Service Reliability problems. You will be responsible for building and optimizing solutions that can analyze massive amounts of telemetry and other Service Health indicators in near real time and perform automated root cause analysis and necessary mitigations to restore SLO's.

Requirements

  • 6+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration.
  • 4+ years of experience running large scale cloud services.
  • Ability to meet Microsoft, customer and/or government security screening requirements.

Nice To Haves

  • 2+ years of operational experience in improving Service Reliability, Availability and Performance.
  • Understanding of Observability and MELT implementation patterns for large-scale services.
  • Experience in Logic Apps and authoring Jupyter Notebooks.
  • Experience in analyzing, troubleshooting, and automating root cause analysis and mitigation of incidents impacting large-scale distributed systems.
  • Systematic problem-solving approach, coupled with effective communication skills and a sense of curiosity.
  • Ability to deal with the ambiguity associated with working in a fast-paced environment.

Responsibilities

  • Collaborating closely with engineering teams on building and enhancing tooling and automation solutions for faster resolution of issues impacting SLO's and averting incidents altogether when possible.
  • Collaborating with the customers to understand their pain points around Supportability and SLO attainment and formulate strategies for addressing recurring issues in a sustainable way.
  • Communicate on a deeply technical level and be the single point of contact for interfacing with enterprise customers for handling service escalations and driving the issues to resolution.
  • Ability to design and implement any changes to service telemetry for the automation to consume if it is not already available.
  • Enhancing customer facing experience by proactive alerting based on utilization, trends, resource health, etc.
  • Analyze data and provide operational insights into customer experience to design and product teams, so that we can design features with supportability in mind.
  • Embody our culture and values.

Benefits

  • Industry leading healthcare
  • Educational resources
  • Discounts on products and services
  • Savings and investments
  • Maternity and paternity leave
  • Generous time away
  • Giving programs
  • Opportunities to network and connect

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Industry

Professional, Scientific, and Technical Services

Education Level

Bachelor's degree

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service