Senior Site Reliability Engineer

D-Wave

130d•$124,364 - $185,545

About The Position

We are seeking a talented and experienced Senior Site Reliability Engineer (SRE) to join our DevOps team. As a key member of the team, you will be responsible for the reliability of our SaaS product, our research laboratory, and the infrastructure supporting our production quantum computers worldwide. You will play a critical role in ensuring the reliability, scalability, and performance of our company’s systems and infrastructure. The ideal candidate will have a strong background in systems administration, automation and troubleshooting complex distributed systems.

Requirements

4+ years of experience operating and troubleshooting SaaS/PaaS applications and environments on a major cloud platform – AWS and GCP preferred – including platform-specific monitoring technologies like Cloudwatch and Stackdriver
4+ years of experience with high level SRE work including incident management, process design, managing on-call rotations (with PagerDuty), and cross-training new and existing employees
Experience with on-premises compute, including servers, storage, power, virtualization, and networking equipment, including specifically using SNMP to monitor networked devices
4+ years of experience with AOS/Elasticsearch/Loki or similar log management tools
Experience with time series databases like Prometheus/InfluxDB, document stores like MongoDB, and classic relational databases like PostgreSQL, AWS Redshift, etc.
Proficiency in InfluxQL and PromQL
Significant expertise supporting and integrating analytics and monitoring systems such as ELK, Grafana, Prometheus, Zabbix, LibreNMS, Intermapper, etc.
At least two years of programming experience in Python, Go, Bash, Ruby, or equivalent
Degree in Computing Science, Engineering or equivalent education and experience
Excellent oral and written communication skills – you like to document your work!

Nice To Haves

3+ years specific experience with Elasticsearch / AWS OpenSearch, Fluent, Grafana Cloud
Experience with Kubernetes monitoring
Experience with producing synthetic metrics and instrumenting existing applications and platforms to extract metrics for analysis
Experience with OpenTelemetry
Proven record of cross-training and evangelizing observability as a critical aspect of all systems

Responsibilities

Refine, refactor, and evolve monitoring systems and related tools covering our workloads in AWS, GCP, on-premises, and remote field systems across the world
Work with teams including software and hardware engineering, processor development, cryogenics, and customer support to elicit requirements, collect and store metrics, analyze trends, and provide dashboards and other tooling to enable observability across the organization
Own the alerting with other SREs to support infrastructure and on-call management systems and ensure alerting is reliable and scalable
Work closely with the DevOps on and Test Engineering teams to enable instrumenting builds and deploys to ensure reliability through every step of the software development lifecycle

Benefits

Company ownership
Competitive pay
Range of meaningful benefits

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

Bachelor's degree

Number of Employees

101-250 employees

Senior Site Reliability Engineer

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company