Controls and Systems Software Engineer

General Matter•Los Angeles, CA

17d•$100,000 - $200,000

About The Position

We are seeking a highly capable DevOps / Site Reliability Engineer to help build and operate the software systems underpinning uranium enrichment R&D and production infrastructure. This role is foundational to our reliability, safety, and developer velocity. You will be responsible for designing and maintaining observability, alerting, and developer productivity systems, and for ensuring that critical internal and production services are correctly instrumented and monitored. We are only interested in candidates with strong fundamentals, sound judgment, and the ability to operate with rigor in a production environment where failures matter.

Requirements

Strong fundamentals in web service development and distributed systems
Solid understanding of networking concepts, DNS, TLS/certificate management, and HTTP
Experience operating and debugging production systems
Familiarity with observability tools (metrics, logging, alerting) and incident response
Ability to write clear, maintainable code and automation scripts
Demonstrated ownership, attention to detail, and sound technical judgment

Nice To Haves

Experience with modern observability stacks (e.g., Prometheus, Grafana, OpenTelemetry, Datadog)
Hands-on experience with cloud infrastructure and infrastructure-as-code
Exposure to CI/CD pipelines and developer tooling at scale
Experience supporting safety-critical or high-reliability systems
Strong debugging skills across application, OS, and network boundaries
Prior on-call experience in a production environment

Responsibilities

Design, implement, and maintain observability and alerting systems across critical services and infrastructure
Ensure all production and internal services are properly instrumented with metrics, logs, and traces
Own and maintain developer productivity tools, CI/CD systems, and internal platforms
Participate in an on-call rotation and respond to production incidents with urgency and discipline
Lead incident reviews and drive long-term reliability improvements
Automate operational workflows to reduce manual toil and improve system resilience