About The Position

Zello is a voice-first communication platform, powered by our industry-leading push-to-talk technology, to improve collaboration and productivity for desk-less workers. With over 175+ million users, we’re the #1 rated push-to-talk app in the world, delivering 9 billion (yes, with a B) messages a month. At Zello, our company values are at the heart of what we do everyday. We’re proud to serve the frontline, we’re privileged to connect people in times of crisis across the globe, and we’re honored to support first responders. And this is where you come in. We’re looking for a Site Reliability Engineer to help us make our systems more observable, performant, and resilient. You’ll work closely with our platform and application teams to build the tooling, practices, and insights that keep Zello reliable as we scale. After a successful first year, you will have Implemented end-to-end observability tooling for application and infrastructure metrics, traces, and logs. Delivered profiling and tracing systems that surface performance bottlenecks before they impact users. Defined and tuned alerting to ensure only high-signal, actionable incidents reach engineers. Helped evolve Zello’s incident response and postmortem processes, ensuring consistent learning and improvement. Provided developers with clear visibility into application performance and release impact, driving data-informed engineering.

Requirements

  • BSc in Computer Science or equivalent experience.
  • 6+ years of experience in site reliability, DevOps, or software engineering roles.
  • Deep understanding of monitoring, alerting, and observability platforms (e.g., Prometheus, Grafana, Loki, OpenTelemetry).
  • Experience implementing tracing, logging, and profiling for distributed systems.
  • Strong background in incident management, postmortem practices, and reliability metrics.
  • Familiarity with Linux, Kubernetes, Terraform, and GCP (preferred) or other major clouds.
  • Proficiency in a scripting or backend language (e.g., Python, Go, Bash).
  • Excellent problem-solving, communication, and collaboration skills.
  • Passionate about eliminating toil and driving continuous improvement in system health.

Responsibilities

  • Build and maintain monitoring, tracing, and profiling systems that empower teams to measure and improve performance.
  • Partner with cross-organization teams to define SLIs, SLOs, and SLAs that reflect real user experience.
  • Lead efforts to optimize observability, from instrumentation standards to dashboard design.
  • Participate in and help coordinate our on-call rotation, incident response, and post-incident reviews.
  • Continuously evaluate and recommend tools or process improvements to strengthen reliability and reduce alert fatigue.
  • Collaborate on platform improvements that enhance system resilience and developer velocity.

Benefits

  • competitive pay
  • equity with significant upside
  • intentionally design our benefits to encourage healthy and well-balanced employees, flexible schedules and time off
  • sabbatical after every five years of service
  • ping-pong table and free snacks in our break room
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service