Site Reliability Engineer

PsiQuantum-posted 3 months ago

Palo Alto, CA

251-500 employees

Computer and Electronic Product Manufacturing

Resume

Match Score

Upload and Match ResumeTrack Jobs with Teal

Join the OS/Platform team as a Site Reliability Engineer (SRE) and keep our services healthy, observable, and fast. Partnering with the Platform Engineering group, you'll own the day-to-day operation of our monitoring stack-Grafana, Prometheus, Loki, and Tempo-crafting dashboards that surface golden signals and drive real-time insight. You'll codify reliability through SLIs/SLOs, automate runbooks in Python, and lead incident response to maintain world-class uptime across both on-prem and AWS environments.

Define, implement, and iterate on Service Level Indicators & Service Level Objectives (SLIs/SLOs) and error budgets for critical services, with a focus on network reliability and data centre interconnects.
Build and maintain Grafana dashboards that visualize golden signals (latency, traffic, errors, saturation), extending coverage to network telemetry such as packet loss, jitter, bandwidth utilization, and BGP/EVPN stability.
Operate and tune the observability pipeline (Prometheus, Loki, Tempo) to ensure scalable, low-latency telemetry ingestion and alerting for networking as well as compute layers.
Drive incident response: triage, mitigate, perform post-incident reviews, and implement preventive actions-particularly for network-related outages, congestion, or misconfigurations.
Develop automation and self-service tooling in Python/Bash to streamline alerts, runbooks, and operational tasks, including network monitoring and diagnostics.
Collaborate with Platform, Product, and Networking teams on capacity planning, performance testing, traffic engineering, and change management.
Improve CI/CD health checks and release safety nets within GitLab, with attention to network dependencies in deployments.
Contribute to Infrastructure as Code (Terraform, Ansible) for monitoring stack deployments and upgrades, including network observability tooling and configuration.

Bachelor's Degree or higher in Computer Science, Engineering, or related technical field.
5+ years in an SRE, DevOps, or Production Engineering role supporting distributed systems in production.
Hands-on expertise with observability tools: Grafana, Prometheus, Loki, Tempo (or equivalent).
Proven track record designing dashboards and alerts around golden signals and USE/RED methodologies, extended to network utilization, saturation, and error metrics.
Solid scripting/automation skills in Python and Bash; familiarity with GitLab CI pipelines.
Operational experience with Kubernetes and containerized workloads.
Strong working knowledge of AWS services, data centre networking fundamentals, routing protocols, load balancing, and network overlays (e.g., VXLAN/EVPN).
Experience running incident response and writing actionable post-mortems, including for network-related events.
Familiarity with Infrastructure as Code (Terraform, Ansible) and configuration management.
Strong communication and collaboration skills; comfortable acting as a generalist across infrastructure, networking, application, and data layers.

Exposure to regulated environments, multi-region networking architectures, and hybrid on-prem/cloud topologies.

Track Jobs with Teal

Job Search Resources

•

AI Resume Builder

•

Site Reliability Engineer Resume Examples

•

Site Reliability Engineer Cover Letter Examples

Site Reliability Engineer

Job Search Resources

Tools

Career Hubs

Guides

Company