Site Reliability Engineer

PsiQuantum•Palo Alto, CA

213d•$120,000 - $140,000

About The Position

Join the OS/Platform team as a Site Reliability Engineer (SRE) and keep our services healthy, observable, and fast. Partnering with the Platform Engineering group, you’ll own the day‑to‑day operation of our monitoring stack—Grafana, Prometheus, Loki, and Tempo—crafting dashboards that surface golden signals and drive real‑time insight. You’ll codify reliability through SLIs/SLOs, automate runbooks in Python, and lead incident response to maintain world‑class uptime across both on‑prem and AWS environments.

Requirements

Bachelor’s Degree or higher in Computer Science, Engineering or other related technical field.
5+ years in an SRE, DevOps, or Production Engineering role supporting distributed systems in production.
Hands‑on expertise with observability tools: Grafana, Prometheus, Loki, Tempo (or equivalent).
Proven track record designing dashboards and alerts around golden signals and (Utilization, Saturation, Errors) USE and RED (Rate, Errors, Duration) methodologies.
Solid scripting/automation skills in Python and Bash; familiarity with GitLab CI pipelines.
Operational experience with Kubernetes and containerized workloads.
Working knowledge of AWS services, networking fundamentals, and load balancing.
Experience running incident response and writing actionable post‑mortems.
Familiarity with Infrastructure as Code (Terraform, Ansible) and configuration management.
Exposure to regulated environments and multi‑region architectures is a plus.
Strong communication and collaboration skills; comfortable acting as a generalist across infrastructure, application, and data layers.

Nice To Haves

Exposure to regulated environments and multi‑region architectures is a plus.

Responsibilities

Define, implement, and iterate on Service Level Indicators & Service Level Objectives (SLIs/SLOs) and error budgets for critical services.
Build and maintain Grafana dashboards that visualize golden signals (latency, traffic, errors, saturation) for engineers and stakeholders.
Operate and tune our observability pipeline (Prometheus, Loki, Tempo) to ensure scalable, low‑latency telemetry ingestion and alerting.
Drive incident response: triage, mitigate, perform post‑incident reviews, and implement preventive actions.
Develop automation and self‑service tooling in Python/Bash to streamline alerts, runbooks, and operational tasks.
Collaborate with Platform and Product teams on capacity planning, performance testing, and change management.
Improve CI/CD health checks and release safety nets within GitLab.
Contribute to infrastructure as code (Terraform, Ansible) for monitoring stack deployments and upgrades.

Benefits

Equity and benefits eligibility for full-time roles.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

Bachelor's degree

Number of Employees

251-500 employees

Site Reliability Engineer

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company