Staff Site Reliability Engineer (Collaboration Engineering)

NBCUniversal•Orlando, FL

16h•Hybrid

About The Position

The Staff Reliability Engineer (SRE) for Workplace Engineering is responsible for the reliability, performance, security, and operational excellence of enterprise workplace collaboration & endpoint services used globally by employees and partners. This role applies an engineering mindset to operations—defining service level indicators/objectives (SLIs/SLOs), reducing toil through automation, improving observability, and strengthening incident response—to ensure a consistent, high-quality collaboration experience across messaging, meetings, voice, file sharing, knowledge sharing, device management platforms & Copilot / AI engineering. Microsoft 365: Teams (chat, meetings, webinars, Teams Phone), SharePoint Online, OneDrive, Exchange Online, Microsoft Entra ID (Azure AD), Microsoft Purview, Defender for Office 365, Intune (Endpoint Management). Hybrid messaging and identity integrations (as applicable): Exchange Server, directory synchronization, mail flow and routing Collaboration endpoints and devices: Teams Rooms, certified headsets/cameras, conference room AV integrations Ecosystem integrations: Power Platform (Power Automate/Apps), Graph API, third-party conferencing/messaging where in use (e.g., Zoom/Slack), mail hygiene/security gateways Architect and optimize global Microsoft Intune and Jamf Pro environments. Orchestrate Windows Updates for Business (WUfB), third-party application patching, and compliance policies to maintain a hardened security posture Automated packaging and deployment of Windows applications, maintaining a rigorous cadence for third-party updates. leverage PowerShell and Graph API to automate repetitive configuration tasks and self-healing remediations. Partner with Security Operations to remediate vulnerabilities. Develop and enforce Configuration Profiles, Compliance Policies, and Conditional Access rules Own the reliability and scaling of Azure Virtual Desktop (AVD) and Windows 365 (Cloud PC), optimizing for both performance and cost-efficiency. Define and operationalize SLIs/SLOs and error-budget policies for collaboration services (Teams chat/meetings/voice, SharePoint/OneDrive, Exchange) with clear customer-impact measurements. Own end-to-end reliability engineering: capacity planning, performance tuning, resilience reviews, dependency mapping, and proactive risk reduction for critical collaboration journeys. Demonstrated expertise in developing, operationalizing, and scaling AI engineering capabilities, including platform design, model lifecycle management, automation, reliability, and enterprise adoption. Strong knowledge of AI governance frameworks, with experience establishing guardrails for responsible AI use, risk management, security, compliance, data controls, and ongoing operational oversight. Build and evolve observability for collaboration platforms: health dashboards, telemetry standards, alert strategy (high signal/low noise), and synthetic monitoring aligned to user experience. Lead incident response for high-severity events: establish incident roles, drive rapid triage/mitigation, coordinate cross-team communication, and produce blameless post-incident reviews with durable corrective actions. Engineer automation to reduce operational toil: provisioning, policy/config drift detection, lifecycle management, reporting, and remediation using PowerShell and APIs; establish reusable runbooks and self-service patterns. Strengthen change and release practices: production readiness reviews, controlled rollouts, maintenance windows, validation plans, and rollback strategies to reduce customer impact. Partner with Security/Compliance to ensure collaboration services meet governance requirements (identity and access, DLP, retention, eDiscovery, information protection), while balancing usability and reliability. Provide Staff-level technical leadership: set engineering standards, mentor engineers, influence roadmap priorities, and align stakeholders on reliability tradeoffs and investment. Establish and lead reliability operating mechanisms (on-call standards, incident command readiness, postmortem quality, action-item governance, and quarterly reliability reviews) to improve consistency across teams. Coach, mentor, and sponsor engineers across levels: provide technical guidance, review designs and postmortems, and raise the bar on documentation, runbooks, and operational readiness. Drive cross-organization alignment on reliability priorities and investment by presenting trends, risks, and proposals to leadership; secure commitments and ensure delivery against measurable outcomes. Serve as an escalation point for complex, cross-domain issues spanning identity, messaging, endpoints, and network dependencies; engage vendors as needed and ensure issues are driven to resolution.

Requirements

12+ years of experience in reliability engineering, systems engineering, DevOps, or large-scale collaboration/communications operations (enterprise or SaaS), including ownership of production services
Deep expertise with collaboration platforms and ecosystems: Microsoft 365 (Teams—including voice/meetings/Rooms—SharePoint Online, OneDrive, Exchange Online) and their dependencies (identity, endpoints, networking)
Hands-on experience defining SLIs/SLOs, building observability (metrics/logs/traces), and operating an incident management program (on-call, severity model, communications, postmortems)
Strong automation skills with PowerShell and APIs (Microsoft Graph preferred); ability to build tooling that improves reliability and reduces toil
Experience with cloud identity and access (Microsoft Entra ID/Azure AD, Conditional Access, MFA, RBAC/PIM) and collaboration governance (Purview, DLP, retention, eDiscovery) preferred
Bachelor’s degree in Computer Science/Engineering (or equivalent practical experience)

Nice To Haves

Executive-level written and verbal communication skills; able to translate reliability data into clear decisions, tradeoffs, and action plans
Proven ability to influence across functions (Security, Network, End User Computing, Architecture, Product/Program) without formal authority
Strong systems thinking and customer satisfaction, focuses on user journeys (chat, meetings, voice, file sharing) and measurable experience outcomes
Demonstrated technical leadership through mentorship, sponsorship, and talent development; builds inclusive, high-performing engineering culture
High bar for operational excellence, insists on clear ownership, durable fixes, strong postmortems, and measurable follow-through
Comfort operating in ambiguity and driving large, multi-quarter improvements with measurable results

Responsibilities

Responsible for the reliability, performance, security, and operational excellence of enterprise workplace collaboration & endpoint services.
Define service level indicators/objectives (SLIs/SLOs) and reduce toil through automation.
Improve observability and strengthen incident response.
Architect and optimize global Microsoft Intune and Jamf Pro environments.
Orchestrate Windows Updates for Business (WUfB), third-party application patching, and compliance policies.
Automate packaging and deployment of Windows applications.
Leverage PowerShell and Graph API to automate repetitive configuration tasks and self-healing remediations.
Partner with Security Operations to remediate vulnerabilities.
Develop and enforce Configuration Profiles, Compliance Policies, and Conditional Access rules.
Own the reliability and scaling of Azure Virtual Desktop (AVD) and Windows 365 (Cloud PC).
Define and operationalize SLIs/SLOs and error-budget policies for collaboration services.
Own end-to-end reliability engineering: capacity planning, performance tuning, resilience reviews, dependency mapping, and proactive risk reduction.
Develop, operationalize, and scale AI engineering capabilities.
Establish guardrails for responsible AI use, risk management, security, compliance, data controls, and ongoing operational oversight.
Build and evolve observability for collaboration platforms.
Lead incident response for high-severity events.
Engineer automation to reduce operational toil.
Strengthen change and release practices.
Partner with Security/Compliance to ensure collaboration services meet governance requirements.
Provide Staff-level technical leadership: set engineering standards, mentor engineers, influence roadmap priorities, and align stakeholders.
Establish and lead reliability operating mechanisms.
Coach, mentor, and sponsor engineers across levels.
Drive cross-organization alignment on reliability priorities and investment.
Serve as an escalation point for complex, cross-domain issues.

Benefits

Equal employment opportunities to all applicants and employees without regard to race, color, religion, creed, gender, gender identity or expression, age, national origin or ancestry, citizenship, disability, sexual orientation, marital status, pregnancy, veteran status, membership in the uniformed services, genetic information, or any other basis protected by applicable law.
Right to request a reasonable accommodation if you are a qualified individual with a disability or a disabled veteran and require support throughout the application and/or recruitment process as a result of your disability.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume