PsiQuantum-posted 5 months ago
$120,000 - $140,000/Yr
Full-time • Mid Level
Palo Alto, CA
251-500 employees

Join the OS/Platform team as a Site Reliability Engineer (SRE) and keep our services healthy, observable, and fast. Partnering with the Platform Engineering group, you’ll own the day‑to‑day operation of our monitoring stack—Grafana, Prometheus, Loki, and Tempo—crafting dashboards that surface golden signals and drive real‑time insight. You’ll codify reliability through SLIs/SLOs, automate runbooks in Python, and lead incident response to maintain world‑class uptime across both on‑prem and AWS environments.

  • Define, implement, and iterate on Service Level Indicators & Service Level Objectives (SLIs/SLOs) and error budgets for critical services.
  • Build and maintain Grafana dashboards that visualize golden signals (latency, traffic, errors, saturation) for engineers and stakeholders.
  • Operate and tune our observability pipeline (Prometheus, Loki, Tempo) to ensure scalable, low‑latency telemetry ingestion and alerting.
  • Drive incident response: triage, mitigate, perform post‑incident reviews, and implement preventive actions.
  • Develop automation and self‑service tooling in Python/Bash to streamline alerts, runbooks, and operational tasks.
  • Collaborate with Platform and Product teams on capacity planning, performance testing, and change management.
  • Improve CI/CD health checks and release safety nets within GitLab.
  • Contribute to infrastructure as code (Terraform, Ansible) for monitoring stack deployments and upgrades.
  • Bachelor’s Degree or higher in Computer Science, Engineering or other related technical field.
  • 5+ years in an SRE, DevOps, or Production Engineering role supporting distributed systems in production.
  • Hands‑on expertise with observability tools: Grafana, Prometheus, Loki, Tempo (or equivalent).
  • Proven track record designing dashboards and alerts around golden signals and (Utilization, Saturation, Errors) USE and RED (Rate, Errors, Duration) methodologies.
  • Solid scripting/automation skills in Python and Bash; familiarity with GitLab CI pipelines.
  • Operational experience with Kubernetes and containerized workloads.
  • Working knowledge of AWS services, networking fundamentals, and load balancing.
  • Experience running incident response and writing actionable post‑mortems.
  • Familiarity with Infrastructure as Code (Terraform, Ansible) and configuration management.
  • Exposure to regulated environments and multi‑region architectures is a plus.
  • Strong communication and collaboration skills; comfortable acting as a generalist across infrastructure, application, and data layers.
  • Exposure to regulated environments and multi‑region architectures is a plus.
  • Equity and benefits eligibility for full-time roles.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service