Staff Site Reliability Engineer - Spacetime

Aalyria

92d•Hybrid

About The Position

This isn't a "keep the lights on" SRE role. This is a strategic, high-impact opportunity to build the nervous system for a platform that transforms how networks of satellites, ground stations, and fleets are interconnected and orchestrated. You will be building the core observability stack that ensures the reliability of systems critical to the operation of satellite megaconstellations and missions to deep space. This is a greenfield/brownfield opportunity. You will be the foundational expert, defining the strategy and building the tools that empower our engineers. You will own the roadmap to mature our observability stack and build a robust, scalable, and insightful platform built on best-in-class technologies (e.g. Prometheus, OpenTelemetry, etc.). If you are an SRE who thrives on platform-building challenges and wants to own a production-grade observability stack from the ground up, this role is for you. Note: this role includes on-call responsibilities.

Requirements

7+ years of experience in an SRE or platform engineering role, with a focus on observability for large-scale, distributed compute or network systems.
Deep, hands-on expertise building, scaling, and managing observability platforms (e.g., Prometheus, Grafana, Loki/ELK, OpenTelemetry, Tempo/Jaeger, Honeycomb, etc.). You have proven experience using these tools to support performance analysis and debugging of complex distributed systems.
Strong production-level experience with Google Cloud Platform (GCP) and Kubernetes.
Proven mastery of Infrastructure as Code (IaC) with Terraform and GitOps principles (e.g., ArgoCD).
Proficiency in a systems programming language, with a strong preference for Go and Python for debugging and writing tooling.
Demonstrable experience defining, implementing, and managing SLOs, SLIs, and error budgets for production services.

Nice To Haves

Experience operating a multi-cloud environment, specifically GCP and AWS.
Hands-on experience with GitLab CI for CI/CD pipelines.
Working knowledge of service mesh technologies such as Istio or Linkerd.
Experience with high-performance computing (HPC) environments and instrumenting numerical optimization workloads
Familiarity with instrumenting applications written in Golang and C++.
Experience with JVM observability (tuning, monitoring) for Java-based applications.

Responsibilities

Design, build, and own the technical roadmap for Aalyria's centralized observability platform, integrating and scaling tools for metrics (Prometheus), logging (Loki), and distributed tracing (Tempo/OpenTelemetry).
Define, implement, and manage a robust framework of Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for our core products, ensuring we are launch-ready.
Establish and evangelize observability best practices, providing standards, documentation, and tooling (e.g., OpenTelemetry libraries) to empower our Go and Java application teams to instrument their services effectively.
Partner with core software engineers to provide the tools and insights needed to debug performance, optimize computational pipelines (including CPU/GPU workloads), and ensure the reliability of large-scale distributed systems.
Automate the deployment, scaling, and management of the entire observability stack using Infrastructure as Code (Terraform) and GitOps principles (ArgoCD).
Partner closely with the core infrastructure team to ensure deep visibility into our Kubernetes clusters and underlying GCP and AWS environments.
Develop and lead the company's monitoring, alerting, and incident response strategy, driving a culture of proactive reliability and blameless post-mortems.