One C++ codebase. Three radically different execution environments. We're looking for an engineering manager who thinks in terms of platforms, abstractions, and portable software architecture — and can lead a team that ships all three. Our SoC HAL (Hardware Abstraction Layer) team builds the platform software layer for AWS's custom Trainium and Inferentia ML accelerator chips. The HAL is a shared library that boots, configures, and manages every hardware block on the SoC — 270+ instances per chip — and the same source tree compiles and runs on SystemVerilog DPI for chip verification, QEMU for system emulation, and Carbon OS in microcontrollers within the AWS production fleet. Your platform abstractions are what make this possible, and your APIs are the interface that 100's of engineers across verification, emulation, and production use to interact with the chip. Tech stack: C++17, CMake, GoogleTest, Python, SystemVerilog DPI, SPI, APB/AXI bus protocols, PCIe, UCIe, HBM, PLL, custom IPs Most platform software teams target one OS or one hardware family. We target three execution environments from a single source tree — and our software must be stateless, survive live-updates on running production servers without reboots, and be correct down to individual register bits. A single abstraction leak can break chip verification, stall emulation, or misconfigure millions of servers in AWS's global fleet. The HAL runs on an external microcontroller running embedded Linux, reaching into the chip over SPI and PCIe. It's stateless by design: the microcontroller can reboot at any time — including during customer workloads — and the HAL must resume managing the SoC by querying hardware state on-demand. Your platform layer is what makes this resilience possible while keeping the complexity invisible to consumers. The same codebase that runs in pre-silicon simulation months before tape-out is the codebase that runs in production fleet. When the chip comes back from the fab, your team validates that pre-silicon models match real hardware behavior. For Trainium3, our HAL enabled a full ML training workload within 12 hours of first power-on: https://www.aboutamazon.com/news/aws/trainium-3-ultraserver-faster-ai-training-lower-cost No ML background needed. Your platform software is the foundation that enables ML training across clusters of thousands of interconnected accelerators — you'll work on components like PCIe and HBM, but won't need to understand ML itself. This role can be based in Cupertino, CA or Austin, TX. The team is split between the two sites.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Manager
Education Level
No Education Listed