Senior Site Reliability Engineer

D-Wave
78d$124,364 - $185,545

About The Position

We are seeking a talented and experienced Senior Site Reliability Engineer (SRE) to join our DevOps team. As a key member of the team, you will be responsible for the reliability of our SaaS product, our research laboratory, and the infrastructure supporting our production quantum computers worldwide. You will play a critical role in ensuring the reliability, scalability, and performance of our company’s systems and infrastructure. The ideal candidate will have a strong background in systems administration, automation and troubleshooting complex distributed systems.

Requirements

  • 4+ years of experience operating and troubleshooting SaaS/PaaS applications and environments on a major cloud platform – AWS and GCP preferred – including platform-specific monitoring technologies like Cloudwatch and Stackdriver
  • 4+ years of experience with high level SRE work including incident management, process design, managing on-call rotations (with PagerDuty), and cross-training new and existing employees
  • Experience with on-premises compute, including servers, storage, power, virtualization, and networking equipment, including specifically using SNMP to monitor networked devices
  • 4+ years of experience with AOS/Elasticsearch/Loki or similar log management tools
  • Experience with time series databases like Prometheus/InfluxDB, document stores like MongoDB, and classic relational databases like PostgreSQL, AWS Redshift, etc.
  • Proficiency in InfluxQL and PromQL
  • Significant expertise supporting and integrating analytics and monitoring systems such as ELK, Grafana, Prometheus, Zabbix, LibreNMS, Intermapper, etc.
  • At least two years of programming experience in Python, Go, Bash, Ruby, or equivalent
  • Degree in Computing Science, Engineering or equivalent education and experience
  • Excellent oral and written communication skills – you like to document your work!

Nice To Haves

  • 3+ years specific experience with Elasticsearch / AWS OpenSearch, Fluent, Grafana Cloud
  • Experience with Kubernetes monitoring
  • Experience with producing synthetic metrics and instrumenting existing applications and platforms to extract metrics for analysis
  • Experience with OpenTelemetry
  • Proven record of cross-training and evangelizing observability as a critical aspect of all systems

Responsibilities

  • Refine, refactor, and evolve monitoring systems and related tools covering our workloads in AWS, GCP, on-premises, and remote field systems across the world
  • Work with teams including software and hardware engineering, processor development, cryogenics, and customer support to elicit requirements, collect and store metrics, analyze trends, and provide dashboards and other tooling to enable observability across the organization
  • Own the alerting with other SREs to support infrastructure and on-call management systems and ensure alerting is reliable and scalable
  • Work closely with the DevOps on and Test Engineering teams to enable instrumenting builds and deploys to ensure reliability through every step of the software development lifecycle

Benefits

  • Company ownership
  • Competitive pay
  • Range of meaningful benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service