Site Reliability Engineer (Top Secret Clearance)

SpaceX•Hawthorne, CA

21h•$145,000 - $175,000

About The Position

As a member of the Classified IT Systems Engineering team, the Site Reliability Engineer is involved in designing scalable systems capable of supporting a growing volume of data products being generated in mass. We build tools that enable us to work more efficiently, and that help us build software systems that are secure, reliable, and autonomous. Our engineers are responsible for the life cycle of the systems they create, including development, testing, and operational support.

Requirements

Bachelor’s degree in computer science, information systems/IT, or an engineering discipline; OR 2+ years of professional experience in software, DevOps, or site reliability engineering in lieu of a degree
1+ year of experience with Kubernetes
1+ year of experience with Linux operating systems
Experience in Bash, Python, and/or other scripting languages
Experience building, maintaining, and scaling on-premises and/or cloud systems designed

Nice To Haves

Active Top Secret, Top Secret SCI, or DOE Level Q clearance is highly desired
Experience hosting and pushing the state of the art in inferential model benchmarks
Experience with systems administration, site reliability engineering, or DevOps engineering
Experience with Python and Python-based development frameworks
Experience with virtualization and hypervisor technologies
Experience with automatically managing dozens or hundreds of servers
Knowledge of performance bottlenecks and performance improvement techniques
Excellent communications skills with the ability to communicate with customers, peers, management etc. in both formal and informal situations
Ability to quickly learn new tools and frameworks.

Responsibilities

Develop automation to deploy and manage compute resources both on-premises and in the cloud
Build, maintain, and scale on-premises hardware systems designed to host GPU-accelerated machine learning workloads
Deploy and manage core infrastructure such as databases, monitoring and storage
Closely collaborate with software engineers to create highly scalable, operable and maintainable products
Engage in and improve the whole lifecycle of services -- from inception and design, through deployment, operation and refinement