Senior Software Engineer, AI Infrastructure

The Allen Institute for Artificial Intelligence•Seattle, WA

3h•$126,000 - $189,000•Onsite

About The Position

This role is for a Senior Software Engineer focused on AI Infrastructure. The position is based in Seattle, with on-site requirements varying by team. The company is a non-profit research institute dedicated to building AI for the common good, emphasizing Open Science, radical transparency, and mission over margin. They operate at the pace of a tech startup with the soul of a research lab. The Beaker Ecosystem is a key part of their infrastructure, coordinating the training of frontier models across large GPU clusters. The challenge for this role is to build the infrastructure that makes AI breakthroughs transparent and accessible, bridging the gap between researchers and GPU clusters. The engineer will be responsible for ensuring intelligent job scheduling and flawless hardware execution.

Requirements

8+ years of professional experience developing business-critical software and operating large-scale compute infrastructure.
Proficiency in Go and/or Python preferred.
Bachelor’s degree in related field; relevant advanced degree may substitute for equivalent years of technical work experience.
Expert-level knowledge of Linux internals, and container runtimes like Docker.
A proven track record of designing, debugging, and optimizing high-scale distributed systems and databases.
Exceptional writing skills and the ability to drive consensus across diverse groups of researchers and engineers.
A principled approach to engineering: You care about how systems are built and are excited by the unique constraints and freedoms of a non-profit research environment.

Nice To Haves

Applied experience with workload schedulers (like Kubernetes or Slurm) and high-performance networking (NCCL and InfiniBand).
Prior experience training or fine-tuning frontier AI models.
Deep systems administration expertise or "Site Reliability Engineering" (SRE) background in an HPC context.
Experience contributing to open-source infrastructure or orchestration projects.
Familiarity with on-prem storage systems like WEKA and Ceph.

Responsibilities

Independently design and deliver critical systems that span the entire stack—from the Beaker job scheduler to the execution runtime.
Build innovative tooling and software-defined infrastructure to accelerate researcher velocity and automate cluster health management.
Conduct root-cause analysis on complex distributed system failures and implement optimizations for distributed workloads.
Provide valuable input into the roadmap for managing large-scale HPC systems, including the deployment of compute, networking, and storage in partnership with leadership.
Foster a high-performance culture by reviewing code/design docs, mentoring team members, and driving process improvements within the team.
Effectively communicate and collaborate with internal research staff to share system designs, gather feedback, and support engineers on implementation tasks.

Benefits

Medical, dental, vision, and an employee assistance program coverage for team members and their families.
Health savings account plan enrollment.
Healthcare reimbursement arrangement plan enrollment.
Health care and dependent care flexible spending account plans enrollment.
Company’s 401k plan enrollment.
$125 per month to assist with commuting or internet expenses.
$200 per month for fitness and wellbeing expenses.
Up to ten sick days per year.
Up to seven personal days per year.
Up to 20 vacation days per year.
Twelve paid holidays throughout the calendar year.
Annual bonuses.
Participation in the long-term incentive plan.