Site Reliability Engineer

MicrosoftRedmond, WA
20h

About The Position

The Site Reliability Engineer (SRE) for Azure xDPU Storage Team – Hardware Enablement is responsible for ensuring the reliability, availability, and performance of Fungible DPU based Azure Storage devices as they integrate next-generation networking and compute offload hardware. This role focuses on safe bring-up, validation, and scaled production operation of DPU-enabled platforms, bridging hardware, firmware, and software reliability and maintenance. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Requirements

  • Associate's Degree in Computer Science, Information Technology, or related field Bachelor's Degree in Computer Science, Information Technology, or related field
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Nice To Haves

  • Bachelor's Degree in Computer Science, Electrical Engineering, Computer Engineering, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience.
  • Experience operating large-scale, distributed systems in a lab/validation.
  • Experience working close to hardware, including networking, storage, or accelerator technologies such as SmartNICs, DPUs, or offload engines.
  • Proficiency in one or more programming or scripting languages (C++, C#, Python, Go, or PowerShell); with experience reading lower-level system code.
  • Hands-on experience with Microsoft and Azure lab infrastructure and live-site operations
  • Demonstrated understanding of networking, operating systems, and performance characteristics of I/O-intensive distributed systems.
  • Direct experience with Fungible DPU technology or similar SmartNIC/DPU platforms.
  • Existing hands-on experience working in Microsoft MLS (Microsoft Lab Services) or equivalent internal lab environments, including lab-based hardware validation, performanc testing, and bring-up workflows.
  • Experience enabling new hardware platforms or accelerators in a Windows/mixed OS environment.
  • Familiarity with firmware lifecycles, hardware validation, and silicon bring-up processes.
  • Experience with infrastructure-as-code and CI/CD pipelines (ARM/Bicep, Terraform, Azure DevOps).

Responsibilities

  • Own end-to-end reliability for Azure Storage hardware running in on-prem lab environments.
  • Partner with silicon, firmware, BIOS, networking, and OS teams to enable and validate DPU hardware for specific storage use cases.
  • Define, measure, and improve Service Level Objectives (SLOs), Service Level Indicators (SLIs) for DPU-accelerated storage scenarios within our lab and pre-prod environments.
  • Lead live-site incident response and mitigation for hardware-, firmware-, or DPU-related issues, including deep root-cause analysis across hardware/software boundaries within our lab and pre-prod environments.
  • Build automation for provisioning, configuration, validation, canarying, rollback, patching, and recovery of DPU-enabled Azure Storage systems within our lab and pre-prod environments.
  • Develop reliability validation strategies, including stress, fault-injection, and chaos testing for DPU hardware enablement and management.
  • Create and maintain operational runbooks, diagnostics, telemetry, and health models specific to Fungible DPU platforms within our lab and pre-prod environments.
  • Drive improvements in observability and alerting by extending Azure Monitor and internal systems with DPU- and hardware-level signals.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service