PsiQuantum-posted 3 months ago
Palo Alto, CA
251-500 employees
Computer and Electronic Product Manufacturing

Join the OS/Platform team as a Site Reliability Engineer (SRE) and keep our services healthy, observable, and fast. Partnering with the Platform Engineering group, you'll own the day-to-day operation of our monitoring stack-Grafana, Prometheus, Loki, and Tempo-crafting dashboards that surface golden signals and drive real-time insight. You'll codify reliability through SLIs/SLOs, automate runbooks in Python, and lead incident response to maintain world-class uptime across both on-prem and AWS environments.

  • Define, implement, and iterate on Service Level Indicators & Service Level Objectives (SLIs/SLOs) and error budgets for critical services, with a focus on network reliability and data centre interconnects.
  • Build and maintain Grafana dashboards that visualize golden signals (latency, traffic, errors, saturation), extending coverage to network telemetry such as packet loss, jitter, bandwidth utilization, and BGP/EVPN stability.
  • Operate and tune the observability pipeline (Prometheus, Loki, Tempo) to ensure scalable, low-latency telemetry ingestion and alerting for networking as well as compute layers.
  • Drive incident response: triage, mitigate, perform post-incident reviews, and implement preventive actions-particularly for network-related outages, congestion, or misconfigurations.
  • Develop automation and self-service tooling in Python/Bash to streamline alerts, runbooks, and operational tasks, including network monitoring and diagnostics.
  • Collaborate with Platform, Product, and Networking teams on capacity planning, performance testing, traffic engineering, and change management.
  • Improve CI/CD health checks and release safety nets within GitLab, with attention to network dependencies in deployments.
  • Contribute to Infrastructure as Code (Terraform, Ansible) for monitoring stack deployments and upgrades, including network observability tooling and configuration.
  • Bachelor's Degree or higher in Computer Science, Engineering, or related technical field.
  • 5+ years in an SRE, DevOps, or Production Engineering role supporting distributed systems in production.
  • Hands-on expertise with observability tools: Grafana, Prometheus, Loki, Tempo (or equivalent).
  • Proven track record designing dashboards and alerts around golden signals and USE/RED methodologies, extended to network utilization, saturation, and error metrics.
  • Solid scripting/automation skills in Python and Bash; familiarity with GitLab CI pipelines.
  • Operational experience with Kubernetes and containerized workloads.
  • Strong working knowledge of AWS services, data centre networking fundamentals, routing protocols, load balancing, and network overlays (e.g., VXLAN/EVPN).
  • Experience running incident response and writing actionable post-mortems, including for network-related events.
  • Familiarity with Infrastructure as Code (Terraform, Ansible) and configuration management.
  • Strong communication and collaboration skills; comfortable acting as a generalist across infrastructure, networking, application, and data layers.
  • Exposure to regulated environments, multi-region networking architectures, and hybrid on-prem/cloud topologies.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service