TPM GPU Operations Standards and Quality

Microsoft CorporationRedmond, WA
27d

About The Position

Microsoft's Cloud Operations & Innovation (CO+I) powers cloud services by ensuring datacenter availability and operational continuity. The Global IT Service Transition team standardizes processes so new sites and IT support teams can achieve Day 1 Operational Readiness efficiently. The Technical Program Manager (TPM) for GPU Operations Standards and Quality leads the development and enforcement of operational standards, quality assurance, and readiness for GPU deployments. This role partners across engineering, supply chain, and operations to ensure GPU deployments meet regulatory, security, and performance standards, enabling scalable and reliable operations.

Requirements

  • Bachelor's Degree AND 2+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience.
  • 1+ year(s) of experience managing cross-functional and/or cross-team projects.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Nice To Haves

  • Bachelor's Degree AND 5+ years experience engineering, product/technical program management, data analysis, or product development OR equivalent experience.
  • 4+ years of experience managing cross-functional and/or cross-team projects.
  • 1+ year(s) of experience reading and/or writing code (e.g., sample documentation, product demos).

Responsibilities

  • Align with Microsoft's culture, objectives and Datacenter Operational policies and standards.
  • Deliver a best-in-class, new service transition and onboarding program to achieve site & operational readiness.
  • Define and implement GPU operational standards across deployment, servicing, and lifecycle management.
  • Drive cross-functional programs to define, implement, and validate GPU compliance standards across global datacenter environments.
  • Partner with engineering, supply chain, and operations teams to ensure GPU hardware and software configurations meet internal and external compliance requirements.
  • Lead risk assessments and mitigation strategies related to GPU deployments, including site operational readiness and scaled growth.
  • Develop and maintain documentation for GPU standards, audit procedures, and compliance tracking.
  • Represent CO+I in industry forums and regulatory engagements related to GPU infrastructure.
  • Establish KPIs and reporting mechanisms to monitor compliance health and drive continuous improvement.
  • Lead quality assurance initiatives to ensure compliance with performance, reliability, and safety benchmarks.
  • Develop and maintain readiness scorecards and validation frameworks for GPU infrastructure.
  • Coordinate cross-functional efforts across hardware, serviceability, and tooling teams.
  • Manage escalations, fault code governance, and exception handling for GPU-related incidents.
  • Drive continuous improvement through data-driven insights and stakeholder feedback.
  • Evolve operational excellence with key focus areas of risk management, uptime availability and safety.
  • Build strong working relationships and engagement with our Engineering, Procurement & Construction (EPC) teams, support and tooling partners.
  • Establish operational representation through design, build, commissioning and turnover project phases, as required.
  • Create an environment to promote learning and innovation opportunities.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Publishing Industries

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service