storage server teams

Amazon•Cupertino, CA

22h

About The Position

We are seeking an experienced Systems Development Engineer to lead the development of automation software, diagnostic tooling, and fleet health infrastructure for our server platforms. You will work across multiple teams and organizations to build scalable, reliable systems that keep our storage and accelerated (AI/ML) compute fleet healthy — with a vision toward zero-touch operations where automation detects, diagnoses, and resolves issues without human intervention. You will be a technical leader solving complex architectural problems that may not be well-defined in advance. You will own your team's systems, proactively identify deficiencies, write scalable and robust code to solve issues before they impact customers. You will decompose large, difficult server testability, reliability, and diagnosis problems into straightforward tasks and components — leading delivery yourself and through others in parallel — using a combination of hardware, software, system design, processor architecture, diagnostics, and operations knowledge. You will collaborate with a variety of roles (SDEs, SDETs, Mechanical/Electrical/Hardware Engineers, TPMs, Managers, Principals) and organizations through server conception, test validation, qualification, launch, and operations — driving high quality and reliability into current and future designs for AWS server solutions. You will also work closely with ODMs and Design Partners to ensure our tooling, diagnostics, and automation requirements are met throughout the hardware development lifecycle (NPI). Systems Development Engineers in AWS Hardware Engineering wear many hats. From orchestration tooling development to hardware integration to kernel driver debugging, we dive deep into problems across the breadth of AWS. Our teams are directly responsible for launching and maintaining server hardware in the fleet — including storage servers powering distributed storage platforms and AI/ML accelerator servers with GPUs. Located in Seattle and Cupertino, we work with internal development teams, ODMs, and design partners to deliver servers deployed in datacenters worldwide.

Requirements

6+ years of non-internship professional software development experience
6+ years of systems design, software development, operations, automation, and process improvement experience
6+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience
5+ years of programming with at least one modern language such as C++, C#, Java, Python, Golang, PowerShell, Ruby experience
Experience with Linux/Unix
Experience leading the design, build and deployment of complex and performant (reliable and scalable) software solutions in production

Nice To Haves

Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
Experience taking a leading role in building complex software or computing infrastructure that has been successfully delivered to customers
7+ years of professional experience
Experience building predictive failure detection or proactive remediation systems at fleet scale
Experience with Linux kernel driver development
Experience with storage, compute, GPU/accelerator platforms (NVIDIA), including driver integration, diagnostics, or performance validation
Experience with distributed storage systems (block, object, or file)
- Familiarity with server hardware architecture, BMC/IPMI, firmware, PCIe topology, NVLink, and hardware diagnostics
Experience working with ODMs or hardware design partners through the product development lifecycle
Experience building zero-touch or self-healing automation for large-scale infrastructure
Experience working in large-scale datacenter or cloud environments
Track record of rapidly coming up to speed on new engineering disciplines and making impactful decisions
Experience with hardware bring-up, validation, and fleet-wide deployment
Familiarity with telemetry pipelines, anomaly detection, and operational metrics at scale

Responsibilities

Build and own the automation infrastructure responsible for the health of the server fleet across storage and accelerator (AI/ML) compute platforms
Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation to identify hardware issues before they cause customer impact
Drive toward zero-touch operations — building automation that detects, diagnoses, triages, and remediates hardware and software faults without human intervention
Develop monitoring tools, dashboards, and alerting systems to provide real-time visibility into fleet health across lab and production environments
Define and track fleet health metrics (failure rates, mean time to detect, mean time to repair, first-time fix rate, predictive accuracy)
Debug and resolve complex system-level issues across storage, compute, GPU, networking in production environments
Troubleshoot Linux boot and runtime failures across x86 and ARM architectures, including PCIe, power, NIC, NVMe, and GPU subsystems
Perform root cause analysis on hardware failures — correlating across firmware, kernel, driver, and physical layer to isolate faults
Build diagnostic tooling that automates root cause identification and reduces reliance on manual triage
Lead the definition and development of software, automation, and enabling tools for server hardware programs; track and report progress
Design and build scalable system-level software with focus on durability, availability, security, and diagnostics
Develop and maintain device drivers for Linux on ARM and x86 architectures
Build automation solutions using modern programming languages (Python, Ruby, Java, C/C++, etc.)
Work with OS internals, storage subsystems, and accelerator/GPU software stacks in Linux-based environments
Build, manage, and deploy CI/CD pipelines for rapid deployment of code changes to org-owned and customer-owned systems
Work across internal HWEng teams to ensure new server hardware addresses data path and control path functionality needed by dependent service teams
Work closely with internal customers to identify early any potential problems onboarding new servers — storage or accelerated compute — into their ecosystem
Engage with ODMs and design partners on testability, diagnostic, and automation requirements during hardware design and development
Contribute to server design to improve robustness, testability, diagnosability, and reliability
Partner with datacenter operations teams to close the loop between field failures and design improvements

Benefits

health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
401(k) matching
paid time off
parental leave

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume