About The Position

We’re hiring a Principal TPM Data & Telemetry – Windows Reliability, an individual contributor role to strengthen our Reliability Telemetry & Insights function. This role ensures we can consistently operate and evolve the systems that measure Windows reliability and translate signals into clear, actionable decisions for engineering and partner teams. This role is equal parts telemetry operations, data quality/governance, and insight-to-action program leadership. You will own critical reliability datasets and dashboards end-to-end (from ingestion and validation through reporting and operational rhythms), partner across Windows engineering and ecosystem stakeholders, and help the team scale by building repeatable processes, documentation, and broader bench strength. Windows reliability is only as strong as the telemetry and operational system behind it. This role ensures our teams can detect regressions early, confidently explain what’s happening, and drive the right corrective actions—without being dependent on a single person’s knowledge. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Requirements

  • Bachelor's Degree AND 6+ years’ experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Nice To Haves

  • 3+ years of experience managing cross-functional and/or cross-team projects.
  • 7+ years of experience in one or more of: program management, data/analytics engineering, reliability engineering, telemetry operations, or product analytics.
  • Demonstrated experience owning end-to-end telemetry/analytics systems (ingestion → validation → modeling → dashboards → operational consumption).
  • Solid skills in data querying and analysis (e.g., Kusto/ADX, SQL, equivalent large-scale log analytics).
  • Experience building decision-grade reporting (e.g., Power BI or equivalent) and communicating insights to senior stakeholders.
  • Proven ability to drive cross-functional execution: aligning stakeholders, assigning ownership, and delivering outcomes through ambiguity.
  • Operational excellence mindset: quality bars, monitoring, incident management, documentation, and continuous improvement.
  • Familiarity with Windows reliability concepts (crash telemetry, drivers, servicing, regressions, device cohorts).
  • Experience with large-scale cloud data platforms (Azure data ecosystem, distributed pipelines, identity resolution).
  • Ability to automate analysis/reporting (Python, C#, Spark, data pipelines, workflow orchestration).
  • Prior experience working with hardware + software ecosystem partners (OEMs, IHVs, silicon vendors) or device quality programs.
  • Experience defining metrics/governance: semantic layers, taxonomy, standard definitions, and “single source of truth” design.
  • Comfort operating in a fast-paced environment with multiple stakeholders and shifting priorities.
  • Solid written communication skills (executive-ready narratives; clear action framing).
  • Ability to be both hands-on (querying/debugging) and high-leverage (driving alignment and ownership).

Responsibilities

  • Own/operate core reliability data pipelines and reporting workflows (availability, correctness, latency, completeness).
  • Establish operational rigor: runbooks, on-call/backup coverage, incident response, and clear escalation paths.
  • Drive data quality improvements: schema management, identity resolution, deduplication, and metric definitions.
  • Build and maintain dashboards and recurring scorecards that track key reliability outcomes (e.g., crash trends, top drivers/components, device cohorts, regressions, risk flags).
  • Proactively identify “what changed” and “why it matters” signals; translate to recommended actions and owners.
  • Create and maintain clear metric definitions, methodology notes, and interpretation guidance to avoid confusion/misalignment.
  • Collaborate with Windows engineering (e.g., kernel/driver/servicing stakeholders), quality teams, and partner-facing teams to align on measurement and priorities.
  • Support OEM/silicon/partner conversations with accurate, explainable reliability telemetry and narratives.
  • Drive cross-team alignment on what actions are required when telemetry indicates regressions or out-of-policy behavior.
  • Identify gaps in telemetry coverage and propose/drive work to close them (instrumentation improvements, new cuts, improved categorization).
  • Improve automation and scale: reduce manual reporting, simplify repetitive analysis, and harden tools so others can self-serve.
  • Establish durable operational rhythms: weekly/monthly reviews, action tracking, and follow-through mechanisms.
  • Document critical workflows and institutional knowledge (how-to guides, data lineage, known pitfalls, “how to debug” playbooks).
  • Create training and enablement materials so others can reliably back up the function.
  • Design work so it is system-owned rather than person-owned (clear ownership maps, redundancy, measurable SLAs).

Benefits

  • Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Principal

Number of Employees

5,001-10,000 employees

© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service