Site Reliability Engineer

ZoomSan Jose, CA
11hHybrid

About The Position

As a Site Reliability Engineer, you can anticipate opportunities to work on our hybrid systems across the globe. You will be responsible for installing, configuring, and monitoring new systems within a network of global data centers. Additionally, you will patch and maintain thousands of physical and cloud systems worldwide. To streamline operations, you will develop automation to reduce repetitive tasks and analyze and address performance bottlenecks. Furthermore, you will update and troubleshoot user access permissions, resolve network connectivity issues, and maintain system firewalls. About the Team Zoom's SRE team is committed to delivering customer happiness, improving business efficiency, and promoting agility through innovation, data-driven insights, and automation. Our impact is reflected in smooth user experiences, optimized processes, and support for Zoom's expansion in the realm of communication and collaboration.

Requirements

  • Have a Bachelors or Master’s degree in Computer Science or related major
  • Demonstrate 2-5 years of hands-on experience in Site Reliability Engineering, DevOps, or Production Operations roles
  • Demonstrate proficiency in scripting languages including Python and Shell
  • Have experience in Linux systems administration with a focus on Ubuntu
  • Able to participate in on-call shifts and incident management and work after hours/weekends for infra change/deployment
  • Apply analytical and troubleshooting skills with ability to diagnose complex system issues.
  • Have experience with CI/CD pipelines (e.g. Jenkins, GitLab CI) and version control systems (e.g. Git)
  • Have experience with build automation, configuration management tools (e.g. Ansible), and IaC provisioning tools (e.g. Packer/Terraform)
  • Have experience with bare metal infrastructure and datacenter operations, including proficiency in operating system deployment tools (Foreman, Cobbler, MAAS etc.)
  • Have experience using Kubernetes or Linux certified

Responsibilities

  • Installing, configure, monitor and maintain systems within a network of global data undefined.
  • Develop automation scripts and tools using Python, and Shell to streamline operations and reduce manual intervention.
  • Monitor and analyze system performance metrics to identify and address potential issues proactively.
  • Designing, implementing, and maintaining CI/CD pipelines to enable rapid and reliable software deployments across multiple environments.
  • Monitoring, troubleshooting, and optimizing production systems to ensure uptime for critical Zoom infrastructure.
  • Collaborating with other teams to troubleshoot system performance issues and promote SRE best practices.
  • Participating in on-call rotation to provide around-theclock support for production incidents and system emergencies.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service