Site Reliability Engineer, Kubernetes Platform (Starshield)

SpaceX•Hawthorne, CA

3d•$125,000 - $175,000•Onsite

About The Position

At SpaceX we’re leveraging our experience in building rockets and spacecraft to deploy the Starshield constellation. Starshield is the world’s largest US government satellite constellation and is tasked with providing immediate access to critical intelligence and national security data for the US government anywhere on the globe. We design, build, test, and operate all parts of the system – receivers that allow users to connect within minutes, and the software that brings it all together. We’ve only begun to scratch the surface of Starshield's global impact and are looking for best-in-class engineers to help us further our ambitious goals. As an engineer focused on Starshield's software and network infrastructure, you will design, operate and scale the infrastructure we use to run the world’s largest government satellite constellation. These positions cover a variety of areas ranging from Site Reliability Engineering, Developer Operations, and our internal Kubernetes platforms. You will develop automation to deploy and manage on-premise compute resources, create highly scalable and maintainable software products, and directly collaborate with engineering across the board.

Requirements

Bachelor’s degree in computer science, information systems/IT, or an engineering discipline and 1+ years of professional experience in site reliability engineering or DevOps; OR 3+ years of professional experience in site reliability engineering or DevOps in lieu of a degree
1+ years of professional experience with Linux operating systems
Experience with Terraform, Ansible, or other infrastructure tools
Experience with containerization technologies (i.e. OCI containers, Kubernetes)
Experience scripting in Bash, Python, or other similar languages
Development experience in Python, C++, or Go

Nice To Haves

1+ years of experience with Python and Python-based development frameworks
Experience managing Kubernetes clusters, not just using them
Knowledge of Linux boot process and systems configuration
Deep understanding of testing, continuous integration, build, deployment & continuous monitoring
Understanding of relevant build technologies, such as Bazel and Makefiles
Focus on performance bottlenecks and performance improvement techniques
Understanding of distributed databases and data modeling
Experience with automatically managing dozens, hundreds, or thousands of servers (eg: Terraform or Ansible)
Strong networking knowledge of TCP/IP
Excellent communications skills with the ability to communicate with customers, peers, management etc. in both formal and informal situations
Active Top Secret, Top Secret SCI, or DOE Level Q clearance

Responsibilities

Develop automation to deploy and manage on-premise Kubernetes clusters
Deploy and manage core infrastructure such as databases, monitoring and distributed storage
Closely collaborate with software engineers to create highly scalable, operable, and maintainable products
Engage in and improve the whole lifecycle of services -- from inception and design, through deployment, operation and refinement
Monitoring and alerting supporting systems to have high availability
Hands-on integration and troubleshooting across the entire Starshield stack
Identify areas for improvement and create innovative solutions that enable high system availability