Senior Staff Site Reliability Engineer

Archer•San Jose, CA

46d•Hybrid

About The Position

Archer is an aerospace company based in San Jose, California building an all-electric vertical takeoff and landing aircraft with a mission to advance the benefits of sustainable air mobility. We are designing, manufacturing, and operating an all-electric aircraft that can carry four passengers while producing minimal noise. Our sights are set high and our problems are hard, and we believe that diversity in the workplace is what makes us smarter, drives better insights, and will ultimately lift us all to success. We are dedicated to cultivating an equitable and inclusive environment that embraces our differences, and supports and celebrates all of our team members. Staff Site Reliability Engineer (Hybrid - San Jose, CA) The Role We are looking for a Staff Site Reliability Engineer to join our SRE organization as a technical specialist. In this role, you will focus on engineering the "glue" that makes our systems resilient—building custom internal tools, refining our observability stack, and ensuring our SLO/SLI frameworks are technically sound. You will work alongside our existing SRE teams to convert manual operational tasks into automated, programmable infrastructure.

Requirements

Experience: 8+ years in SRE, Production Engineering, or high-scale DevOps environments.
Advanced Programming: Strong software engineering fundamentals. You should be as comfortable writing a microservice or a custom API as you are configuring a load balancer.
Metric Logic: Deep understanding of how to derive meaningful SLIs from complex distributed systems (latency percentiles, success rates, etc.).
Orchestration: Expert-level Kubernetes (EKS, GKE, or self-managed).
IaC: Advanced Terraform or Pulumi.
Observability: Mastery of the "Three Pillars" (Logs, Metrics, Traces) using tools like Prometheus, Jaeger, or Datadog.
Communication: Ability to distill complex system health data into clear, actionable reports for both engineers and business leadership.

Responsibilities

Standardize SRE Procedures: Design and implement consistent technical procedures for incident response, error budget tracking, and production readiness across our service catalog.
Engineer SLOs & SLIs: Technically instrument our services to capture precise SLIs. You will be responsible for the backend logic that calculates Error Budgets and triggers automated alerts based on burn rates.
Build Special Purpose Tooling: Write production-grade code (Go, Python, etc.) to build internal tools that solve specific infrastructure gaps, such as custom Kubernetes operators, automated remediation scripts, or deployment safety gates.
Executive & Operational Dashboards: Create a unified observability layer. This includes deep-dive Grafana/Datadog dashboards for real-time debugging and high-level, aggregate views for executive stakeholders to monitor SLA compliance.
Toil Reduction: Identify repetitive operational tasks across the group and build the automation necessary to eliminate them.
Collaborative Engineering: Work within the SRE group to provide expert-level code reviews for infrastructure changes and contribute to the collective codebase of our internal platforms.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume