DevOps Engineer

ZoomSan Jose, CA
1dHybrid

About The Position

We’re looking for a Senior Site Reliability / DevOps Engineer to help operate, scale, and continuously improve highly reliable SaaS production platforms. These are running across large-scale, distributed environments. In this role, you’ll focus on operational excellence, automation, observability, and performance to ensure availability, reliability, and scalability for customer-facing services. You’ll own impactful initiatives end to end, lead incident response when it matters most. Also you'll partner closely with engineering teams to deliver durable improvements that directly impact customers. You'll join an experienced SRE/DevOps team passionate about reliability and automation. We foster collaboration, fair on-call rotations, and data-driven decisions. Together, we build resilient systems and efficient operations that power our SaaS platform.

Requirements

  • 6–10 years of experience supporting and operating SaaS production systems in DevOps or SRE roles
  • Bring experience designing highly available systems with dynamic uptime targets (up to 99.999%) and hands-on experience planning and executing multi-region disaster recovery (DR) strategies.
  • Bring deep experience operating distributed systems in customer-facing production environments.
  • Experience building and maintaining automation and operational tooling
  • Have deep understanding of system reliability, availability, scalability, and performance optimization with hands-on experience with monitoring, observability and alerting platforms.
  • Provide background in incident management, including on-call operations and root cause analysis
  • Able to deeply understand the services you support, including dependencies and failure modes.
  • Proven track record of owning and delivering operational improvements end to end
  • Champion collaboration and communication skills

Responsibilities

  • Operating, scaling, and continuously improving SaaS production platforms across distributed environments
  • Designing and implementing zero-downtime solutions for highly available services (99.999%)
  • Developing and maintaining disaster recovery (DR) strategies across datacenters in multiple regions.
  • Developing and maintaining automation, tooling, and scripts to improve deployment efficiency and reduce manual operations
  • Implementing and enhancing monitoring, alerting, and observability to proactively detect and prevent issues.
  • Analyzing system behavior and performance data to identify bottlenecks and optimization opportunities
  • Owning system performance, availability, and scalability for customer-facing services.
  • Leading incident response efforts, conduct root cause analysis, and implement long-term remediation
  • Creating and maintaining runbooks and operational documentation to standardize procedures.
  • Define, track, and improve service reliability using SLOs, SLIs, and operational metrics
  • Providing operational input into platform and architecture decisions affecting SaaS services.
  • Mentor engineers and share operational best practices across teams
  • Participating in on-call rotations, incident management, and after-hours or weekend work for application releases and deployments

Benefits

  • As part of our award-winning workplace culture and commitment to delivering happiness, our benefits program offers a variety of perks, benefits, and options to help employees maintain their physical, mental, emotional, and financial health; support work-life balance; and contribute to their community in meaningful ways.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service