Site Reliability Engineer, GNC

SpaceX•Hawthorne, CA

57d•Onsite

About The Position

SpaceX’s mission is to make humanity multiplanetary by developing fully and rapidly reusable launch systems capable of launching Starship multiple times per day while continuing to scale the Starlink constellation. To support these goals, we are seeking a Site Reliability Engineer to operate and scale custom-built, mission-critical products for the Guidance, Navigation, and Control (GNC) teams. GNC teams at SpaceX are responsible for vehicle design, trajectory design and optimization, high-fidelity vehicle simulation, software and control algorithm development, while also supporting both launch and on-orbit operations across multiple vehicle programs. In this role, you will work closely with GNC teams across SpaceX to maintain and improve a suite of critical GNC-focused tools and infrastructure that must scale reliably to enable a multiplanetary future. These systems include on-prem services, large-scale Monte Carlo simulations on our high-performance computing (HPC) cluster, automated data analysis pipelines, continuous integration systems for rocket and simulation software, GNC analysis infrastructure, and vehicle configuration verification tools. The ideal candidate is flexible, possesses broad skills spanning product operations and software development, and thrives in a fast-paced, high-impact environment.

Requirements

Bachelor's degree in computer science, information systems/IT, engineering, math, or scientific discipline and 2+ years of software development experience OR 4+ years of professional experience building software with site reliability or DevOps in lieu of a degree
Experience with Linux operating systems
Experience with Python and Python based development frameworks

Nice To Haves

2+ years of systems administration, site reliability engineering, or DevOps experience
2+ years of experience with Python and Python-based development frameworks
2+ years of Linux experience
Expertise with Docker, Vagrant, and Kubernetes or similar technologies
Extensive Experience with configuration management tools such as Ansible, Puppet, Terraform
Experience with build systems (Make, Bazel / Pants / Buck, Gradle) and package management tools (pip, npm)
Strong understanding of virtualization and hypervisor technologies
Understanding of databases and data modeling
Experience with automatically managing dozens or hundreds of servers
Strong networking knowledge of TCP/IP
Experience scaling web applications and optimizing applications for performance
Experience with managing on-prem infrastructure, including direct experience managing GPU fleets
Experience with high-performance computing systems or large-scale data analysis systems
Must be comfortable working with mission-critical and sensitive systems, with a sense of urgency appropriate to the responsibilities
Ability and willingness to obtain a Top Secret clearance

Responsibilities

Deploy, upgrade, operate, and scale a suite of mission-critical GNC products and services
Provision and maintain virtual and physical servers
Work with SpaceX HPC team to monitor and maintain an HPC cluster consisting of tens of thousands of CPUs.
Closely collaborate with GNC software engineers to create highly operable and maintainable products
Monitoring and incident response for web applications and services
Manage the underlying computational infrastructure of GNC in collaboration with IT stakeholders
Engage in and improve the whole lifecycle of services from whiteboard to operational
Make data-driven recommendations for future hardware purchases
Practice sustainable incident response and postmortems
Provide end-user support to GNC engineering for products by becoming an expert on analysis applications and support users in troubleshooting and pointing to features
Configure automated deployment pipelines for web apps
Develop or improve GNC web apps and tools for better usability, maintainability, and robustness
Demo and document new software changes such as operating system upgrades, shared filesystem changes, or major tool rollouts
Focus on performance bottlenecks and performance improvement techniques

Benefits

long-term incentives, in the form of company stock, stock options, or long-term cash awards
potential discretionary bonuses
ability to purchase additional stock at a discount through an Employee Stock Purchase Plan
comprehensive medical, vision, and dental coverage
access to a 401(k) retirement plan
short and long-term disability insurance
life insurance
paid parental leave
various other discounts and perks
3 weeks of paid vacation
10 or more paid holidays per year
paid sick leave