Senior Technical Program Manager

Microsoft•Redmond, WA

11h

About The Position

Microsoft’s Cloud Operations & Innovation (CO+I) organization powers the infrastructure that enables Microsoft’s cloud services. Within CO+I, Critical Environment Systems Intelligence (CESI) builds and maintains intelligence systems, environmental telemetry pipelines, reliability models, and automation workflows that keep Microsoft’s datacenters operating safely and efficiently at hyperscale. Central to CESI is the Data Center Infrastructure Data Engineering & Analytics (DC IDEA) team extends this foundation by developing telemetry pipelines, analytics platforms, and data models that transform raw datacenter signals into actionable insights. IDEA increases observability, accelerates fault detection, and strengthens operational readiness across Microsoft’s global datacenter fleet. Within IDEA is the RADAR team which leads Microsoft’s sensor‑health visibility and detection strategy. RADAR designs and operates sensor‑health detection logic, alerting frameworks, and triage workflows that ensure sensor reliability across leased and company‑operated datacenter environments, making Microsoft’s cloud more resilient and reliable. As a Senior Technical Program Manager on the RADAR team, you will own cross‑organizational programs that deliver end‑to‑end sensor‑health detection, alerting, and triage. You will design and operationalize workflows, establish clear engagement models, and drive the onboarding of new detection scenarios, directly improving reliability for Microsoft’s global datacenter fleet. This role blends technical depth with program leadership to turn noisy telemetry into actionable signals, streamline incident response, and raise the bar on observability and availability.

Requirements

Bachelor's Degree AND 4+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience.
2+ years of experience managing cross-functional and/or cross-team projects.

Nice To Haves

8+ years of experience in technical program management, engineering, or reliability/observability domains, preferably in the Datacenter Critical Environment space.
Demonstrated ability to lead complex, multi‑team initiatives from concept to production in large‑scale environments.
Ability to read and reason about technical documentation, schemas, APIs, and data models to support design and decision‑making.
Strong analytical and problem‑solving skills; comfortable working with metrics, dashboards, instrumentation, and system‑performance data.
Proven ability to drive clarity, structure, and alignment across engineering and operations stakeholders.
Experience with telemetry ingestion, stream processing, anomaly detection, signal quality evaluation, or alerting systems.
Familiarity with incident management, SRE practices, service‑health measurement, and operational readiness frameworks.
Experience collaborating across hardware, software, and datacenter operations teams in high‑scale technical environments.
Ability to produce concise specifications, frameworks, and operational workflows that enable complex operational teams.

Responsibilities

Lead delivery of RADAR’s mission by implementing and scaling sensor‑health detection, alerting, and triage capabilities across Microsoft datacenters, ensuring high‑quality signal visibility and reliable operational outcomes.
Design and operationalize core workflows for sensor‑health detection, alert routing, validation, and triage, partnering closely with upstream telemetry systems and downstream incident‑response teams.
Drive cross‑team orchestration by creating and strengthening relationships across engineering, hardware, operations, and service teams to integrate and execute multi-feature scenarios and platform capabilities.
Build and manage onboarding processes for new telemetry types and detection scenarios, including requirements templates, validation criteria, handoff procedures, and governance frameworks.
Champion Process Excellence by maturing workflows, training partners, and driving adoption of consistent operating models for new signals, anomaly detection patterns, and incident‑response processes.
Lead partner alignment and influence to shape and deliver shared roadmaps across divisional boundaries, ensuring detection, alerting, and observability capabilities evolve cohesively.
Identify gaps and opportunities through structured feedback loops; synthesize insights into clear problem statements, repeatable patterns, and actionable guidance for leadership and engineering stakeholders.
Manage schedules and execution across epics, sprints, semester plans, and releases, tracking dependencies, anticipating risks, and driving cohesive delivery across partner teams.
Produce clear technical documentation including specifications, decision records, runbooks, and operational procedures to support partner readiness and consistent implementation.
Drive continuous improvement by monitoring detection quality, validating system behavior, and guiding enhancements that strengthen reliability, observability, and operational readiness.