Rack Scale Serviceability & Telemetry Architect

Advanced Micro Devices, IncAustin, TX
Hybrid

About The Position

AMD's Data Center GPU Systems Architecture team defines next-generation AMD Instinct platforms and complete rack-scale solutions for hyperscale AI and HPC deployments. We work across silicon, GPU system firmware, server and board architecture, BMC/platform firmware, management software, security, validation, manufacturing, and ecosystem partners to turn product strategy into deployable, serviceable, production-ready platforms. AMD is seeking a Principal Member of Technical Staff (PMTS) to own the architecture for rack-scale serviceability and telemetry across AMD Instinct product lines and complete rack-scale solutions. This is a highly visible technical leadership role responsible for defining the end-to-end manageability, observability, and serviceability architecture spanning node, chassis/tray, rack, and fleet domains. You will drive the strategy, architecture, execution, and delivery of standards-based solutions for inventory, discovery, health monitoring, telemetry, eventing, diagnostics, firmware lifecycle management, and field service workflows across the full AMD rack-scale stack. In this role, you will independently own a critical cross-product architecture area and drive alignment across GPU/SoC architecture, server/platform architecture, BIOS/UEFI, BMC and embedded software, security, RAS, validation, ODM/OEM partners, and customer-facing teams. The role spans early concept definition through bring-up, validation, deployment, and post-launch improvement.

Requirements

  • Expert level experiences in platform architecture, system management, BMC/embedded firmware, server manageability, or adjacent domains, including significant time in architect or technical leadership roles.
  • Proven experience defining serviceability/manageability architecture for servers, accelerators, storage, networking, or rack-scale infrastructure in datacenter, cloud, AI, or HPC environments.
  • Deep knowledge of DMTF Redfish, including schema design, OEM extension strategy, eventing, update service, and telemetry concepts such as MetricReportDefinition/Metric Reports; strong understanding of PLDM/MCTP for platform inventory, monitoring, control, and update workflows.
  • Strong hands-on experience with OpenBMC, including Yocto/OpenEmbedded, D-Bus, systemd, bmcweb/Redfish, phosphor services, firmware update flows, sensor frameworks, and log/event handling.
  • Experience with embedded Linux, ARM-based BMC SoCs, U-Boot, Linux kernel/device driver concepts, device tree, and low-level interfaces such as I2C/I3C, SPI, UART, GPIO, SMBus/PMBus, and related platform-management buses.
  • Strong understanding of server/platform RAS and serviceability features such as health monitoring, error logging, crashdump, diagnostics, inventory/FRU management, and remote recovery.
  • Experience with secure manageability architectures, including secure boot, root of trust, attestation, firmware signing, SPDM, and protection of out-of-band management paths.
  • Experience creating architecture specifications, product requirements, conformance plans, validation strategies, and design reviews that drive execution across multiple internal teams and external partners.
  • Strong programming and scripting background in C/C++, Python, and shell, with the ability to debug across firmware, hardware, and system software boundaries.
  • Strong written and verbal communication skills with proven ability to influence senior engineering leadership, customers, and strategic partners.
  • Bachelor’s or Master’s degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field. Advanced degree preferred.

Nice To Haves

  • Experience with large-scale telemetry or observability pipelines, metrics consumers, or fleet operations tooling is strongly preferred.
  • Experience with AMD server or GPU platforms, AI/HPC system design, liquid cooling/power/thermal infrastructure, or OCP-aligned rack architectures is a plus.

Responsibilities

  • Define and own the end-to-end rack-scale serviceability and telemetry architecture for AMD Instinct-based solutions, spanning node BMC, chassis/rack management, service processors/controllers, management network, and fleet-level observability integration.
  • Define the standards strategy and interface architecture using DMTF Redfish, PLDM, MCTP, and related specifications, maximizing standards compliance while establishing AMD/OEM extensions only where required.
  • Drive OpenBMC-based architecture and implementation direction for BMC and rack management controllers, including D-Bus object models, bmcweb/Redfish requirements, sensor and FRU inventory models, logging, eventing, firmware update, and debug workflows.
  • Architect telemetry frameworks for health, power, thermal, inventory, error, utilization, and service data. Define schemas, metric taxonomies, triggers, event models, aggregation, retention, and reporting strategies required for at-scale observability and automated service operations.
  • Define platform serviceability flows covering discovery, inventory correlation, fault isolation, diagnostics, crashdump and error capture, remote recovery, FRU replacement, firmware/driver update orchestration, and return-to-service procedures.
  • Partner with GPU/SoC architects, board and system architects, firmware and software teams, security/RAS, validation, manufacturing, and customer engineering to translate requirements into production-ready architecture and deliverables.
  • Work closely with ODM/OEMs and ecosystem partners to review designs, close gaps, guide implementation trade-offs, and deliver robust reference solutions and customer platforms on schedule.
  • Drive validation and conformance strategy for manageability and telemetry, including interoperability, Redfish/PLDM compliance, fault injection, service workflow validation, scale testing, and field debug methodology.
  • Influence future AMD Instinct platform roadmaps using insights from bring-up, partner integrations, deployment learnings, and telemetry-driven data.
  • Represent AMD in relevant standards and open-source communities, including DMTF and OpenBMC forums, and guide upstream/downstream strategy where appropriate.
  • Mentor engineers and architects across the organization and serve as the senior technical point of contact for rack-scale serviceability and telemetry.

Benefits

  • AMD benefits at a glance.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service