Senior Site Reliability Engineer

Zello•Austin, TX

43d

About The Position

Zello is a voice-first communication platform, powered by our industry-leading push-to-talk technology, to improve collaboration and productivity for desk-less workers. With over 175+ million users, we’re the #1 rated push-to-talk app in the world, delivering 9 billion (yes, with a B) messages a month. At Zello, our company values are at the heart of what we do everyday. We’re proud to serve the frontline, we’re privileged to connect people in times of crisis across the globe, and we’re honored to support first responders. And this is where you come in. We’re looking for a Site Reliability Engineer to help us make our systems more observable, performant, and resilient. You’ll work closely with our platform and application teams to build the tooling, practices, and insights that keep Zello reliable as we scale. After a successful first year, you will have Implemented end-to-end observability tooling for application and infrastructure metrics, traces, and logs. Delivered profiling and tracing systems that surface performance bottlenecks before they impact users. Defined and tuned alerting to ensure only high-signal, actionable incidents reach engineers. Helped evolve Zello’s incident response and postmortem processes, ensuring consistent learning and improvement. Provided developers with clear visibility into application performance and release impact, driving data-informed engineering.

Requirements

BSc in Computer Science or equivalent experience.
6+ years of experience in site reliability, DevOps, or software engineering roles.
Deep understanding of monitoring, alerting, and observability platforms (e.g., Prometheus, Grafana, Loki, OpenTelemetry).
Experience implementing tracing, logging, and profiling for distributed systems.
Strong background in incident management, postmortem practices, and reliability metrics.
Familiarity with Linux, Kubernetes, Terraform, and GCP (preferred) or other major clouds.
Proficiency in a scripting or backend language (e.g., Python, Go, Bash).
Excellent problem-solving, communication, and collaboration skills.
Passionate about eliminating toil and driving continuous improvement in system health.

Responsibilities

Build and maintain monitoring, tracing, and profiling systems that empower teams to measure and improve performance.
Partner with cross-organization teams to define SLIs, SLOs, and SLAs that reflect real user experience.
Lead efforts to optimize observability, from instrumentation standards to dashboard design.
Participate in and help coordinate our on-call rotation, incident response, and post-incident reviews.
Continuously evaluate and recommend tools or process improvements to strengthen reliability and reduce alert fatigue.
Collaborate on platform improvements that enhance system resilience and developer velocity.

Benefits

competitive pay
equity with significant upside
intentionally design our benefits to encourage healthy and well-balanced employees, flexible schedules and time off
sabbatical after every five years of service
ping-pong table and free snacks in our break room

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume