Sr. Systems Development Engineer (AWS Generative AI & ML Servers), AWS HW Engineering

Amazon•Seattle, WA

1d•$151,200 - $204,600•Onsite

About The Position

Do you want to build the backbone of Generative AI cloud at AWS? Do you want to build the future of the cloud for AI training and inference? Want to do industry leading work delivering continuous price performance improvements in the cloud for AI model training for multi billion variable LLMs? Come Join us in designing, delivering and operating AWS cloud offerings that enable high performance and scalability in AI/ML and HPC workloads. You are intrigued by the continuous release of newer AWS services and instance types that solve newer, bigger and more interesting business problems every day? Does that make you wish your talents were applied to those at cloud scale? If yes, then come join us - we are looking for builders like you. The AWS Hardware Engineering team creates server designs for Amazon’s innovative web services. Our designs are industry-leading in frugality and operational excellence, and are critical to the success of the AWS business and millions of customers who use AWS today. Our engineers solve challenging technology problems, and build architecturally sound, high-quality components to enable AWS to realize critical business strategies. What makes this role unique is your direct impact on AI-powered innovation - you’ll be building intelligent systems that drive the debug and development of next-generation cloud technologies. The ideal candidate for this role will be an innovative self-starter. You are knowledgeable of the full technical stack - vertically from baremetal server hardware up to the software in userland, and everything in the middle. You have tremendous interest in cloud scale and curious how systems and software decisions impact the user. You insist on highest-standards and are able to develop tactical solutions/tools to diagnose and fix issues. You are an excellent systems debugger - finding interaction issues between components on server systems. You are a leader with strong organizational, planning, and communication skills. You are a builder!

Requirements

4+ years of programming with at least one modern language such as C++, C#, Java, Python, Golang, PowerShell, Ruby experience
3+ years of non-internship professional software development experience
3+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience
Experience with x86 architecture, as well as ARM, and GPU/ FPGA devices
Experience with modern technology devices in storage, network, memory as well as a variety of interface standards and protocols (I2C, IPMI, SPI, PCIe)
BS degree in Computer Science, Computer Engineering, or other technical degree, or relevant work experience

Nice To Haves

7+ years or more in software development, systems development, SRE (Site Reliability Engineering), or Resilience Engineering
7+ years of SysDE (Systems Development Engineer) or equivalent experience
7+ years of server systems debug experience; debugging and root causing complex server platforms
7+ years of experience contributing towards increasing durability, security, availability and scalability of systems through exploration, diagnosis and remediation
Linux kernel and user-space drivers experience for PCIe and external devices
System thinking - ability to diagnose interactions between discrete components of server system and drive product improvements.
Strong focus on reliability, scale and diagnostics (including developing tactical and strategic tools using Python, Go, C/C++ or any other suitable high level language)
Solid understanding of OS internals, including network and storage subsystem
Master’s degree in Electrical Engineering, Computer Engineering, or related field
Experience with modern technology devices in storage, network, memory as well as a variety of interface standards and protocols (PCIe, SATA, SAS, NVMe)
Experience with validation of hardware, software, firmware and drivers and implementing test plans
Experience with server validation, testing, issue root causing and coverage analysis
Excellent knowledge diagnostics tools development with Python, Go, or C/C++ in fast paced environment
Extensive Linux knowledge

Responsibilities

You will be a technical leader solving complex architectural problems which may not defined before hand.
You will be owning the teams systems and work proactively in identifying deficiencies, writing tactical code to solve issues before they impact customers, and working with your team to scale the solution.
You will decompose big difficult server system testability, reliability and diagnosis problems into straightforward tasks, components or features that you will lead to deliver yourself and through others in parallel.
You will use combination of hardware, software, system designs, x86 architecture, processes, diagnosis and operations knowledge.
In this role you will create automation through agentic workflows.
You’ll develop smart automation solutions, implement AI-driven tools and workflows and be part of AI transformation.
Working with a variety of job roles (SDEs, SDETs, Hardware Engineers, TPMs, Managers, Principals) and groups (AWS Hardware Engineering, EC2, other AWS services) through server conception, test, launch, and operations.
Driving high quality and reliability into future/new designs for AWS Accelerated server solutions for AWS Cloud.

Benefits

health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
401(k) matching
paid time off
parental leave
sign-on payments
restricted stock units (RSUs)

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume