Senior Site Reliability Engineer

UnifySan Francisco, CA
5d

About The Position

Unify is redefining go-to-market with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding terabytes of data monthly and supporting enterprise customers with demanding uptime requirements. You'll work across the stack—optimizing databases, hardening services, and building the automation and observability that keep Unify fast and reliable at scale.

Requirements

  • 5+ years of software engineering experience with a strong backend foundation, including 2+ years focused on reliability, infrastructure, or platform work.
  • Hands-on experience operating databases at scale including query optimization, replication, and failover.
  • Strong programming skills (Typescript, Python, Go, or similar) with experience building automation and tooling.
  • Able to diagnose complex distributed systems issues under pressure and communicate clearly during incidents.
  • Collaborative, low-ego attitude and desire to work in a fast-paced environment.

Responsibilities

  • Scale our data infrastructure: Optimize and extend our ClickHouse and PostgreSQL deployments—designing partitioning strategies, tuning queries, and improving replication and failover systems.
  • Improve system performance: Profile and optimize critical paths across backend services, identify bottlenecks in data pipelines and API layers, and ship changes that improve latency and throughput.
  • Build for reliability: Implement rate limiting, circuit breakers, graceful degradation, and other patterns that keep the platform stable under load and during partial failures.
  • Automate everything: Write tooling that eliminates toil—automating deployments, scaling operations, backup verification, and incident remediation.
  • Instrument and observe: Build out distributed tracing, metrics, and alerting that give engineers clear visibility into system behavior and accelerate debugging.
  • Respond and learn: Participate in on-call rotations, run incident response, and drive blameless postmortems that prevent recurrence.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service