Principal Site Reliability Engineer

Microsoft•Redmond, WA

34d

About The Position

Core Services Infrastructure and Security team in Microsoft Teams provides the foundational infrastructure, network, security, monitoring and governance to run planet scale distributed systems and microservices architecture that powers Microsoft Teams. Security and Reliability are at the heart of what this team aspires to do day in and day out. As a Principal Site Reliability Engineer in Core Services Infrastructure and Security team you will be responsible for the Infra and Network Security, front door, routing, gateway, CDN, DNS and monitoring layers for the microservices powering Microsoft Teams. This opportunity will allow you to improve the reliability of such mission critical layers investing in active-active architectures at every possible level, and hone your skills in improving your security acumen working with experts in the space and become adept at troubleshooting and securing the network layer. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Requirements

Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
2+ years of experience on areas like TCP/IP concepts, load balancing, CDN, ACL, routing, TLS, Certificate Lifecycle management, IP network analysis and performance and application issues using standard tools.

Nice To Haves

Doctorate Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 12+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
7+ years technical experience working with large-scale cloud or distributed systems.
Experience with Network security, Network troubleshooting, Cloud Security, Security Policy management and Certificate lifecycle management.
Knowledge of Cloud Infrastructure services or 'Infrastructure as a Service [IaaS]' which delivers computer infrastructure, typically a platform virtualization environment as a service.
Knowledge of automation technologies, leveraging AI for productivity improvements, methods, and processes used for quality and cost improvements.
Experience managing horizontal initiatives/programs that span multiple teams/services.

Responsibilities

You will drive strategic improvements in network security, monitoring and troubleshooting across the service and other stakeholders, while prioritizing development and implementation efforts.
You will obsess on leveraging metrics and monitors to improve the reliability of mission critical dialtone services and scenarios.
You will leverage subject-matter expertise of cross-product features with appropriate stakeholders (e.g., project managers) to drive multiple group's project plans, release plans, and work items.
You will hold accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions.
You will proactively seek new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and shares knowledge with other engineers.
Embody our culture and values

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume