Senior Site Reliability Engineer

Zello•Austin, TX

About The Position

Zello is seeking a Senior Site Reliability Engineer to own the reliability of its data tier (MySQL, MongoDB, ScyllaDB, Elasticsearch, Redis) at high availability while also contributing to the broader platform. This role involves monitoring, on-call duties, and cloud modernization efforts. The company is investing in AI to improve incident response, build agents and tooling for faster root-cause analysis, and enhance developer productivity. The ideal candidate will have experience operating production systems, handling incidents, and documenting procedures. The role is part of the Platform team and reports to the Director of Platform Engineering.

Requirements

Operated highly available MySQL and MongoDB in production at scale, including replication, sharding, backups, point-in-time recovery, and failover drills.
Diagnose database performance end-to-end, including query plan, indexes, locking, OS, storage, and network.
Shipped meaningful work on at least two of bare metal Linux, containerized workloads (Docker, Kubernetes, or similar), and a major cloud (GCP preferred; AWS or Azure equivalent is fine).
Instrumented systems using Prometheus, OpenTelemetry, or comparable systems to close real incidents and written useful dashboards.
Writes production code in Python, Go, Bash, or similar for automation, tooling, or operators.
Communicates clearly under pressure and after the fact, with blameless, specific postmortems that lead to lasting changes.
Brings an opinion on managed vs. self-managed databases and can defend trade-offs based on availability, cost, and operational burden.
7+ years in SRE, DevOps, platform, infrastructure, or database reliability roles.
At least 3 years owning production databases.
BSc in Computer Science or equivalent practical experience.

Nice To Haves

ScyllaDB/Cassandra or Elasticsearch experience.
Experience using AI tooling (copilots, agents, or custom automation) to expedite incident response, root-cause analysis, or developer workflows.

Responsibilities

Design, deploy, and operate highly available MySQL and MongoDB clusters across cloud environments, including replication, sharding, backups, point-in-time recovery, upgrades, and disaster recovery.
Tune query performance, schema, and index strategy in partnership with application engineers and push fixes upstream into the application when appropriate.
Extend the observability stack (Prometheus, Loki, and Tempo) to ensure the data tier is well-instrumented and traces lead to root causes.
Participate in the Platform on-call rotation, lead incident response for data-tier issues, and write postmortems that drive durable change.
Improve disaster recovery, security posture, and compliance for the database footprint, including encryption, access control, audit logging, and backup integrity.
Evaluate and operate ScyllaDB/Cassandra and Elasticsearch, providing opinions on their suitability for specific workloads.
Write automation, tooling, and operators to reduce repetitive work for the team.
Utilize AI to compress incident response and root-cause analysis by building agents, automation, and developer-enablement tooling.

Benefits

Competitive pay
Equity with significant upside
Intentionally designed benefits to encourage healthy and well-balanced employees
Flexible schedules
Time off
Sabbatical after every five years of service
Ping-pong table
Free snacks in break room

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume