Site Reliability Engineer — HPC & Automation (Silicon Engineering)

SpaceXRedmond, WA
$125,000 - $175,000Onsite

About The Position

As a Site Reliability Engineer on the Silicon Engineering team you will get the opportunity to design, operate, scale, and automate the high performance computing infrastructure we use to develop the chips powering the world's largest satellite constellation and a global internet service. This position will have a meaningful impact on Starlink silicon by enabling faster design-iterations, simulations, and regression turnaround times that gate how fast our chip teams can ship.

Requirements

  • Bachelor’s degree in computer science, information systems, or an engineering discipline; OR 2+ years of professional experience in system administration, high performance computing, or site reliability engineering
  • 1+ years of development experience with Bash, Python, and/or other programming languages
  • 1+ years of experience with Linux operating systems

Nice To Haves

  • Familiarity with containerization technologies (i.e. Docker, Kubernetes)
  • Knowledge in computer system concepts (computer architecture, computer organization, operating systems and concurrency)
  • Experience with databases and data modeling (e.g., MySQL, PostgreSQL, SQLite)
  • Networking knowledge of TCP/IP
  • Experience with high performance computing and workload managers (e.g., Slurm, LSF)
  • Experience with Terraform, Ansible, Puppet, or similar automation frameworks
  • Experience building monitoring and alerting as code (e.g., Grafana, Prometheus, custom exporters)
  • Experience with CI/CD automation at scale (e.g., Jenkins, Bamboo, build systems)
  • Experience with infrastructure as code (IaC) tools for managing fleets of servers
  • Experience with using & building REST API clients/servers
  • Experience with enterprise/networked storage automation (e.g., NetApp ONTAP REST API/CLI, NFS)
  • Experience with ASIC design flows and tools (e.g., Cadence, Synopsys, Ansys, Keysight, Siemens)
  • Strong desire to find performance bottlenecks and performance improvement techniques
  • Excellent communication skills with the ability to communicate with customers, peers, management, etc. in both formal and informal situations
  • Ability to quickly learn new tools and frameworks
  • Interest in or experience with AI/LLM-assisted tooling (e.g., Grok, Claude Code)

Responsibilities

  • Deploy, upgrade, operate, maintain, and scale our suite of clusters and services
  • Collaborate with engineers to develop automated, full turnkey solutions for silicon simulation workflows to speed up project timelines
  • Manage our underlying infrastructure as code and use modern observability tools to provide a complete picture of cluster and infrastructure health
  • Operate the continuous integration pipeline, build and release systems, and version control across the environment
  • Identify and eliminate performance bottlenecks using measurement and creative engineering

Benefits

  • long-term incentives, in the form of company stock or long-term cash awards
  • potential discretionary bonuses
  • ability to purchase additional stock at a discount through an Employee Stock Purchase Plan
  • comprehensive medical, vision, and dental coverage
  • access to a 401(k) retirement plan
  • short and long-term disability insurance
  • life insurance
  • paid parental leave
  • various other discounts and perks
  • 3 weeks of paid vacation
  • 10 or more paid holidays per year
  • paid sick time in compliance with state and federal law (for Washington State employees)
  • Company shuttles are offered to employees for roundtrip travel from select Seattle locations to the SpaceX Redmond office Monday to Friday
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service