Senior Site Reliability Engineer

Helion•Everett, WA

47d•Onsite

About The Position

We are a fusion power company based in Everett, WA, with the mission to build the world's first fusion power plant, enabling a future with unlimited clean electricity. Our vision is a world with clean, reliable, and affordable energy for everyone. Since Helion's founding in 2013, we have raised over $1 billion from long-time investors such as Sam Altman, Mithril, and Capricorn Investment Group as well as new investors SoftBank and Lightspeed to propel us forward. Our last prototype, Trenta, completed 10,000 high-power pulses and reached plasma temperatures of 100 million degrees Celsius (9 keV). We are now operating Polaris, our next prototype on the path to the world's first fusion power plant. This is a pivotal time to join Helion. You will tackle real-world challenges with a team that prizes urgency, rigor, ownership, and a commitment to delivering hard truths - values essential to achieving what no one has before. Together, we will change the future of energy, because the world can't wait. The Senior Site Reliability Engineer is a strategic technical leader responsible for designing and maintaining resilient systems and infrastructure. This role involves proactive reliability engineering, incident response leadership, and mentoring junior SREs to uphold high operational standards across the organization. This is an onsite role that reports directly to the Director of IS&T at our Everett, WA office.

Requirements

8+ years of experience in SRE, DevOps, or infrastructure engineering roles
Bachelor's or master's degree in computer science, engineering, or related field
Technical Proficiency: Advanced knowledge of cloud platforms (AWS, GCP, Azure), container orchestration (Kubernetes), and scripting languages (Python, Go, Bash)
Infrastructure Expertise: Deep understanding of distributed systems, networking, and Linux internals
Problem Solving: Strong analytical skills for diagnosing complex system failures and performance bottlenecks
Collaboration: Excellent communication and cross-functional teamwork abilities

Responsibilities

Collaborate with engineering teams to design scalable, fault-tolerant systems that meet performance and reliability goals
Define and manage SLIs, SLOs, and SLAs; implement error budgets and reliability metrics
Lead major incident responses, conduct root cause analyses, and drive postmortem processes
Build and maintain automation for deployments, monitoring, and infrastructure management using tools like Terraform, Kubernetes, and CI/CD pipelines
Develop and maintain observability platforms to ensure real-time system health tracking and proactive alerting
Forecast system demands and optimize performance through load testing and tuning
Collaborate with security teams to ensure infrastructure meets compliance and security standards
Guide junior engineers, promote best practices, and contribute to a culture of reliability and continuous improvement