DevOps Engineer

Zoom•San Jose, CA

1d•Hybrid

About The Position

We’re looking for a Senior Site Reliability / DevOps Engineer to help operate, scale, and continuously improve highly reliable SaaS production platforms. These are running across large-scale, distributed environments. In this role, you’ll focus on operational excellence, automation, observability, and performance to ensure availability, reliability, and scalability for customer-facing services. You’ll own impactful initiatives end to end, lead incident response when it matters most. Also you'll partner closely with engineering teams to deliver durable improvements that directly impact customers. You'll join an experienced SRE/DevOps team passionate about reliability and automation. We foster collaboration, fair on-call rotations, and data-driven decisions. Together, we build resilient systems and efficient operations that power our SaaS platform.

Requirements

6–10 years of experience supporting and operating SaaS production systems in DevOps or SRE roles
Bring experience designing highly available systems with dynamic uptime targets (up to 99.999%) and hands-on experience planning and executing multi-region disaster recovery (DR) strategies.
Bring deep experience operating distributed systems in customer-facing production environments.
Experience building and maintaining automation and operational tooling
Have deep understanding of system reliability, availability, scalability, and performance optimization with hands-on experience with monitoring, observability and alerting platforms.
Provide background in incident management, including on-call operations and root cause analysis
Able to deeply understand the services you support, including dependencies and failure modes.
Proven track record of owning and delivering operational improvements end to end
Champion collaboration and communication skills

Responsibilities

Operating, scaling, and continuously improving SaaS production platforms across distributed environments
Designing and implementing zero-downtime solutions for highly available services (99.999%)
Developing and maintaining disaster recovery (DR) strategies across datacenters in multiple regions.
Developing and maintaining automation, tooling, and scripts to improve deployment efficiency and reduce manual operations
Implementing and enhancing monitoring, alerting, and observability to proactively detect and prevent issues.
Analyzing system behavior and performance data to identify bottlenecks and optimization opportunities
Owning system performance, availability, and scalability for customer-facing services.
Leading incident response efforts, conduct root cause analysis, and implement long-term remediation
Creating and maintaining runbooks and operational documentation to standardize procedures.
Define, track, and improve service reliability using SLOs, SLIs, and operational metrics
Providing operational input into platform and architecture decisions affecting SaaS services.
Mentor engineers and share operational best practices across teams
Participating in on-call rotations, incident management, and after-hours or weekend work for application releases and deployments

Benefits

As part of our award-winning workplace culture and commitment to delivering happiness, our benefits program offers a variety of perks, benefits, and options to help employees maintain their physical, mental, emotional, and financial health; support work-life balance; and contribute to their community in meaningful ways.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume