Site Reliability Engineer II - CTJ -TS/SCI

Microsoft
1d$100,600 - $199,000

About The Position

Join the team that keeps Microsoft 365 running in sovereign cloud environments where reliability, scalability, and security are non-negotiable. You'll work on distributed systems at massive scale, automating operations, building disaster recovery capabilities, and engineering solutions that eliminate toil and improve service delivery. Bring your expertise in large-scale systems and help us set the standard for sovereign cloud reliability. The M365 Sovereign Clouds organization is building the future of secure productivity for the world's most critical customers. As part of Azure Silver and Microsoft Sovereign Clouds, we deliver and operate the full Microsoft 365 suite, including Office 365, Exchange, Outlook, Teams, SharePoint, OneDrive, and Purview within highly regulated sovereign cloud environments. We are a team of innovators and problem-solvers who thrive on transforming complex challenges into reliable, high-performance services that empower sovereign cloud customers. Our culture is rooted in growth mindset, innovation, collaboration, and inclusion, and we believe that diverse perspectives drive our best work. On the Security & Compliance team, you'll work with other engineers on the systems that protect M365 sovereign cloud customers from phishing, malware, spam, and data governance challenges. These systems process and protect millions of messages and documents daily. Our sub-teams offer exciting opportunities to work on highly complex systems that enable information protection and data governance for our customers. The right candidate for this job (is): -Passionate about distributed systems and working with highly scalable services. -Enjoys new technological challenges and is motivated to solve them. -Excited about making better software and continuously improving the development, integration, and deployment processes. -Self-starter who thrives in a bottoms-up, fast-paced, highly technical environment. -Effective collaborator, experienced in creating technical partnerships across teams. -Committed to ensuring exceptional customer satisfaction through technical excellence. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Requirements

  • Master's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python o OR Bachelor's Degree in Computer Science or related technical field AND 5+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Pythono OR equivalent experience.
  • 2+ years technical experience working with large-scale cloud or distributed systems.
  • Candidates must have an active TS/SCI and be willing and eligible to upgrade to TS/SCI (with polygraph). This role will require candidates to maintain the TS/SCI (with polygraph) clearance.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required pre-offer and post-hire for this role.
  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Responsibilities

  • Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, taking appropriate action to mitigate impact, and deploying appropriate fixes to resolve root cause(s). Notifies product teams and owners to major customer impacting issues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed. Communicates details and resolutions through post-mortem reports and review meetings.
  • Independently writes code or scripts that automate the performance of scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale.
  • Designs, develops, and maintains telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of product components and features operating at scale. Independently performs analyses using existing tools and/or models to identify insights and shares them with product engineering teams to directly contribute to improvements in product development and/or operations. Monitors the impact of changes on operations metrics (e.g., Time-to-X).
  • Independently uses existing tools and/or models to troubleshoot problems or flaws affecting the availability, security, reliability, performance, and/or efficiency of components and features, leveraging the artificial intelligence (AI) and machine learning (ML) capabilities. Proposes solutions that will resolve and prevent recurring issues and brings them to the attention of their Site Reliability Engineering (SRE) and/or product engineering teams.
  • Independently creates, tests, and deploys changes through a safe deployment process (SDP) to enhance code quality and improve the observability, security, reliability and operability of one or more platforms, systems, or products operating at scale.
  • Shares insights and best practices via documented artifacts that can be applied to improve development and operations of system, platform, or product components and features by participating in code/design reviews, incident drills and debriefs, and regular meetings, as well as interactions with more experienced SREs and members of product engineering teams.
  • Engages with product engineering teams by participating code/design reviews, regular meetings, on-call rotations and incident responses throughout product development and operations cycles. Utilizes technical knowledge of systems/platforms and insights drawn from product engineering teams, security best practices, artificial intelligence (AI)/machine learning (ML), and telemetry analyses to suggest potential improvements in code base and designs across components and features of one or more products.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service