Site Reliability Engineer — HPC & Automation (Silicon Engineering)

SpaceX•Redmond, WA

17h•$125,000 - $175,000•Onsite

About The Position

As a Site Reliability Engineer on the Silicon Engineering team you will get the opportunity to design, operate, scale, and automate the high performance computing infrastructure we use to develop the chips powering the world's largest satellite constellation and a global internet service. This position will have a meaningful impact on Starlink silicon by enabling faster design-iterations, simulations, and regression turnaround times that gate how fast our chip teams can ship.

Requirements

Bachelor’s degree in computer science, information systems, or an engineering discipline; OR 2+ years of professional experience in system administration, high performance computing, or site reliability engineering
1+ years of development experience with Bash, Python, and/or other programming languages
1+ years of experience with Linux operating systems

Nice To Haves

Familiarity with containerization technologies (i.e. Docker, Kubernetes)
Knowledge in computer system concepts (computer architecture, computer organization, operating systems and concurrency)
Experience with databases and data modeling (e.g., MySQL, PostgreSQL, SQLite)
Networking knowledge of TCP/IP
Experience with high performance computing and workload managers (e.g., Slurm, LSF)
Experience with Terraform, Ansible, Puppet, or similar automation frameworks
Experience building monitoring and alerting as code (e.g., Grafana, Prometheus, custom exporters)
Experience with CI/CD automation at scale (e.g., Jenkins, Bamboo, build systems)
Experience with infrastructure as code (IaC) tools for managing fleets of servers
Experience with using & building REST API clients/servers
Experience with enterprise/networked storage automation (e.g., NetApp ONTAP REST API/CLI, NFS)
Experience with ASIC design flows and tools (e.g., Cadence, Synopsys, Ansys, Keysight, Siemens)
Strong desire to find performance bottlenecks and performance improvement techniques
Excellent communication skills with the ability to communicate with customers, peers, management, etc. in both formal and informal situations
Ability to quickly learn new tools and frameworks
Interest in or experience with AI/LLM-assisted tooling (e.g., Grok, Claude Code)

Responsibilities

Deploy, upgrade, operate, maintain, and scale our suite of clusters and services
Collaborate with engineers to develop automated, full turnkey solutions for silicon simulation workflows to speed up project timelines
Manage our underlying infrastructure as code and use modern observability tools to provide a complete picture of cluster and infrastructure health
Operate the continuous integration pipeline, build and release systems, and version control across the environment
Identify and eliminate performance bottlenecks using measurement and creative engineering

Benefits

long-term incentives, in the form of company stock or long-term cash awards
potential discretionary bonuses
ability to purchase additional stock at a discount through an Employee Stock Purchase Plan
comprehensive medical, vision, and dental coverage
access to a 401(k) retirement plan
short and long-term disability insurance
life insurance
paid parental leave
various other discounts and perks
3 weeks of paid vacation
10 or more paid holidays per year
paid sick time in compliance with state and federal law (for Washington State employees)
Company shuttles are offered to employees for roundtrip travel from select Seattle locations to the SpaceX Redmond office Monday to Friday