DevOps Engineer

ZoomSan Jose, CA
Hybrid

About The Position

We are hiring a Staff DevOps/Site Reliability Engineer to ensure reliability, scalability, and operational excellence for our real-time communications platform. This platform supports audio/video conferencing, recording, and live-streaming functionalities. The position requires expertise in infrastructure engineering, global team collaboration, and cross-functional partnerships. This team manages essential meeting service operations at Zoom. They handle global, large-scale distributed systems and advance communication technology to connect individuals across physical distances.

Requirements

  • 10+ years in DevOps, SRE, or infrastructure engineering roles, with at least 3 years at a staff or principal level scope.
  • Have a proven track record owning reliability for large-scale, distributed, latency-sensitive systems in production
  • Have experience in supporting real-time or media-heavy platforms (video conferencing, live streaming, gaming, trading systems, or similar).
  • Demonstrate ability to lead cross-functional technical initiatives without direct authority, driving alignment across engineering, product, and operations.
  • Have conceptual and architectural understanding of real-time communication protocols: WebRTC, RTP/RTCP, TURN/STUN, SDP, and SFU/MCU topologies.
  • Have solid expertise in cloud infrastructure (AWS, GCP, or Azure) and container orchestration (Kubernetes, Helm, ArgoCD).
  • Demonstrate proficiency with infrastructure-as-code tooling: Terraform, Pulumi, or equivalent.
  • Have experience with observability stacks: Prometheus, Grafana, Datadog, Jaeger, OpenTelemetry, or equivalent.
  • Have an understanding of networking fundamentals: BGP, anycast routing, DNS, load balancing, and CDN architecture.
  • Utilize CI/CD tools such as GitHub Actions, Jenkins, and Spinnaker to streamline workflows and improve deployment processes.
  • Implement deployment safety practices like canary releases, feature flags, and blue/green strategies to ensure reliable software delivery.
  • Demonstrate proficiency in Python, Bash, or Go for automation, tooling, and incident response without requiring advanced software development expertise.

Responsibilities

  • Ensuring reliability engineering and operations by owning the SLO/SLI framework for real-time services, defining, tracking, and improving latency, availability, jitter, and packet loss.
  • Leading incident response for critical outages across the real-time platform, coordinating across time zones and engineering disciplines.
  • Promoting a blameless postmortem culture and ensuring action items lead to measurable reliability enhancements.
  • Implementing chaos engineering and game day exercises to proactively identify failure modes before user impact occurs.
  • Building and evolving observability tools — dashboards, alerting systems, and distributed tracing — tailored for real-time media infrastructure challenges.
  • Serving as the architectural authority on deployment patterns, infrastructure design, and operational readiness for real-time services.
  • Reviewing and contributing to system design proposals, providing feedback on scalability, fault tolerance, and operational complexity.
  • Driving capacity planning, traffic modeling, and cost optimization strategies across globally distributed infrastructure.
  • Evaluating and recommending infrastructure tools, platforms, and vendors — including media servers, CDN providers, cloud-native services, and edge networking.
  • Ensuring consistent standards for CI/CD pipelines, deployment safety, and progressive rollout strategies across teams.
  • Acting as the primary SRE partner for multiple engineering teams building real-time features, attending planning sessions, and providing operational readiness guidance.
  • Collaborating closely with network engineering, security, product, and data teams to align on platform-wide reliability requirements.
  • Translating infrastructure constraints and reliability trade-offs into actionable recommendations for product leaders and engineering teams.
  • Establishing and advocating DevOps best practices — infrastructure-as-code, GitOps, automated testing, and deployment automation — across partner teams.
  • Guiding senior engineers on SRE principles, reliability patterns, and operational discipline.
  • Serving as a technical liaison between US-based and China/India-based engineering teams, bridging communication gaps and providing technical context.
  • Conducting architecture reviews, incident retrospectives, and planning sessions in English and Mandarin as appropriate.
  • Maintaining a flexible schedule to ensure meaningful overlap with teams in Beijing, Shanghai, Bangalore, and Hyderabad.
  • Building collaborative relationships across cultural and geographic boundaries, adapting communication styles to foster trust and alignment.
  • Ensuring engineering documentation, runbooks, and architectural decision records are accessible and understandable for global team members.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service