About The Position

Unify is redefining go-to-market with state-of-the-art AI. As our Staff SRE Tech Lead, you'll own the reliability and scalability of our platform as we add terabytes of data monthly and onboard customers with demanding uptime requirements. You'll set the technical direction for reliability engineering, lead a pod of SREs, and partner directly with the engineering leadership to build the systems and practices that keep Unify fast and reliable at scale.

Requirements

  • 8+ years of software engineering experience with a strong backend foundation, including 3+ years focused on reliability, infrastructure, or platform work.
  • Experience leading teams or pods—setting technical direction, mentoring engineers, and driving execution on complex projects.
  • Deep expertise operating databases at scale, including schema design, query optimization, replication, and failover strategies.
  • Strong programming skills (Typescript, Python, Go, or similar) with a track record of building automation and tooling that meaningfully reduces operational burden.
  • Collaborative, low-ego attitude with a history of leveling up the people around you.

Responsibilities

  • Lead the SRE pod: Set technical direction, drive prioritization, and mentor engineers—ensuring the team is tackling the highest-leverage reliability and scalability challenges.
  • Scale our data infrastructure: Architect and extend our ClickHouse and PostgreSQL deployments to handle terabytes of new data monthly; designing partitioning strategies, tuning queries, and building resilient replication and failover systems.
  • Improve system performance: Profile and optimize critical paths across our backend services, identify bottlenecks in data pipelines and API layers, and ship changes that meaningfully improve latency and throughput.
  • Build for reliability: Design and implement rate limiting, circuit breakers, graceful degradation, and other patterns that keep the platform stable under load and during partial failures.
  • Automate everything: Drive tooling that eliminates toil—automating deployments, scaling operations, backup verification, and incident remediation.
  • Instrument and observe: Build out distributed tracing, metrics, and alerting that give engineers clear visibility into system behavior and make debugging production issues fast.
  • Define and enforce SLOs: Establish reliability targets aligned with customer needs, manage error budgets, and drive architectural decisions that balance shipping speed with system stability.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service