Senior Software Engineer, AI Infrastructure

The Allen Institute for Artificial Intelligence•Seattle, WA

1d•Onsite

About The Position

Persons in these roles are expected to work from our offices in Seattle. On-site requirements vary based on position and team. If you have questions about on-site work arrangements for this role, please ask your recruiter. Our base salary range is $126,000 - $189,000, and in addition we have generous bonus plans to provide a competitive compensation package. Ai2 remains a lighthouse for Open Science. Founded by the late Paul Allen, we are a non-profit research institute dedicated to building AI for the common good. We don't have a stock price to defend or a walled garden to protect. Instead, we have a mission: to provide the global research community with the transparent, high-performance foundations they need to achieve humanity-enriching breakthroughs. We operate at the pace and scale of a world-class tech startup but with the intellectual soul of a research lab. We build and operate systems like Beaker to coordinate the simultaneous training of frontier models (like OLMo) across massive GPU clusters. Our job is to ensure that the next great AI breakthrough isn't stalled by a resource bottleneck or a proprietary gatekeeper. At Ai2, we believe that the most important AI breakthroughs should be transparent and accessible. Your challenge is to build the infrastructure that makes this possible. You will bridge the gap between our researchers and our GPU clusters. You will be a senior technical contributor responsible for ensuring that when a researcher submits a job, the software schedules it intelligently and the hardware executes it flawlessly.

Requirements

8+ years of professional experience developing business-critical software and operating large-scale compute infrastructure.
Proficiency in Go and/or Python preferred.
Bachelor’s degree in related field; relevant advanced degree may substitute for equivalent years of technical work experience.
Expert-level knowledge of Linux internals, and container runtimes like Docker.
A proven track record of designing, debugging, and optimizing high-scale distributed systems and databases.
Exceptional writing skills and the ability to drive consensus across diverse groups of researchers and engineers.
A principled approach to engineering: You care about how systems are built and are excited by the unique constraints and freedoms of a non-profit research environment.

Nice To Haves

Applied experience with workload schedulers (like Kubernetes or Slurm) and high-performance networking (NCCL and InfiniBand).
Prior experience training or fine-tuning frontier AI models.
Deep systems administration expertise or "Site Reliability Engineering" (SRE) background in an HPC context.
Experience contributing to open-source infrastructure or orchestration projects.
Familiarity with on-prem storage systems like WEKA and Ceph.

Responsibilities

Independently design and deliver critical systems that span the entire stack—from the Beaker job scheduler to the execution runtime.
Build innovative tooling and software-defined infrastructure to accelerate researcher velocity and automate cluster health management.
Conduct root-cause analysis on complex distributed system failures and implement optimizations for distributed workloads.
Provide valuable input into the roadmap for managing large-scale HPC systems, including the deployment of compute, networking, and storage in partnership with leadership.
Foster a high-performance culture by reviewing code/design docs, mentoring team members, and driving process improvements within the team.
Effectively communicate and collaborate with internal research staff to share system designs, gather feedback, and support engineers on implementation tasks.