About The Position

We're hiring a Staff Database Reliability Engineer to own the strategy, architecture, and operational excellence of our data infrastructure. This is an expert-level IC role with deep influence on engineering direction, partnering closely with platform, backend, and DevOps engineers. You will own the data tier end-to-end. Design schemas and access patterns that scale, tune Aurora for latency and throughput, and set the standards for how engineers interact with our databases. When a migration script seizes up mid-deploy and writes start queueing behind an ACCESS EXCLUSIVE lock, your runbooks and automation resolve the incident quickly. Make the Django ORM a strength, not a liability: Review migrations for safety at scale — locks, backfills, concurrent index builds, NOT VALID constraints Catch N+1 patterns and missing select_related/prefetch_related in review Establish conventions for QuerySet usage and physical schema design (indexes, constraints, partitioning) Scale review through automation, not heroics — author AGENTS.md files and DNA scaffolding that encode our conventions, configure AI review bots (Claude Code, Cursor, etc.) to flag risky migrations and ORM anti-patterns, and iterate on those configs as new failure modes emerge Lead major infrastructure initiatives: Capacity planning as traffic and engineering throughput grow Zero-downtime schema migrations and cutovers Multi-AZ resilience within a single region — Aurora writer/reader placement, failover behavior and RTO/RPO, ElastiCache and OpenSearch AZ topology, RabbitMQ survivability across AZs Backups, PITR, failover testing, retention Own the CDC pipeline (Aurora → DMS → S3 Parquet → Snowflake): DMS task design and tuning, replication slot hygiene on the Postgres side Schema evolution as Django migrations roll through — so a column rename doesn't silently break the warehouse at 6 AM Parquet layout and partitioning, reliability of the Snowflake handoff Automated checks that flag migrations likely to break downstream consumers Drive observability across three complementary tools: pganalyze — query-level performance, index advisor, schema insights - the go-to for "why is this ORM query slow" CloudWatch — infrastructure metrics and alarms for Aurora, OpenSearch, ElastiCache, SQS, DMS Honeycomb — high-cardinality tracing that ties slow DB calls back to users, flags, deploys, and flows Shape how the three fit together, including Django-side instrumentation and trace attributes on ORM queries Build tooling and guardrails: Migration review automation and CI checks for risky patterns Slow query pipelines fed from pganalyze Self-service dashboards so teams understand their own query footprint Support and evolve the rest of the stack: OpenSearch — index design, sharding, mapping changes, reindexing strategy, Django-side indexing pipelines Redis — caching patterns, eviction, sizing, Django cache framework, Celery/RQ usage, avoiding hot keys and thundering herds SQS + RabbitMQ — queue design, DLQs, visibility timeouts, exchange/queue topology, AZ mirroring, consumer backpressure, Celery behavior under load

Requirements

  • Deep PostgreSQL expertise (EXPLAIN (ANALYZE, BUFFERS), MVCC, bloat, lock contention, vacuum/autovacuum).
  • Strong ORM fluency (Django, SQLAlchemy, ActiveRecord, or similar) including predicting SQL, spotting N+1 problems, and controlling eager loading.
  • Practical understanding of single-region multi-AZ design.
  • Production CDC experience, ideally AWS DMS, including logical replication, slot hygiene, schema evolution, and Parquet-based data lakes feeding Snowflake.
  • Hands-on experience with pganalyze (or Datadog DBM / Performance Insights / pg_stat_statements pipelines), CloudWatch (custom metrics, composite alarms, log insights), and Honeycomb (or another high-cardinality tracing tool).
  • Comfort with OpenTelemetry.
  • Real experience making AI coding and review tools useful for a team, including writing AGENTS.md files and configuring review agents.
  • OpenSearch at scale experience (sizing, sharding, JVM tuning, rolling upgrades, snapshots).
  • Production Redis experience (persistence tradeoffs, cluster mode, hot keys, thundering herds).
  • Experience with at least one production message broker (SQS, RabbitMQ, Kafka), including delivery semantics, idempotency, and failure modes.
  • Strong automation and IaC background with real code (Python, Go, or similar) and Terraform.
  • Comfortable in a high-growth environment.

Nice To Haves

  • Aurora Serverless V2 / Limitless experience strongly preferred (storage model, reader/writer split, ACU scaling).
  • Opinionated about what makes a trace useful.
  • Versioning and iterating on AI prompts and configs.

Responsibilities

  • Own the data tier end-to-end.
  • Design schemas and access patterns that scale.
  • Tune Aurora for latency and throughput.
  • Set the standards for how engineers interact with our databases.
  • Resolve incidents quickly using runbooks and automation when migration scripts seize up mid-deploy and writes start queueing.
  • Review migrations for safety at scale (locks, backfills, concurrent index builds, NOT VALID constraints).
  • Catch N+1 patterns and missing select_related/prefetch_related in review.
  • Establish conventions for QuerySet usage and physical schema design (indexes, constraints, partitioning).
  • Scale review through automation by authoring AGENTS.md files and DNA scaffolding, configuring AI review bots to flag risky migrations and ORM anti-patterns, and iterating on those configurations.
  • Lead major infrastructure initiatives including capacity planning, zero-downtime schema migrations and cutovers, multi-AZ resilience, and backups/PITR/failover testing.
  • Own the CDC pipeline (Aurora → DMS → S3 Parquet → Snowflake), including DMS task design and tuning, replication slot hygiene, schema evolution, Parquet layout and partitioning, and Snowflake handoff reliability.
  • Develop automated checks that flag migrations likely to break downstream consumers.
  • Drive observability across pganalyze, CloudWatch, and Honeycomb, including Django-side instrumentation and trace attributes on ORM queries.
  • Build tooling and guardrails such as migration review automation, CI checks for risky patterns, slow query pipelines, and self-service dashboards.
  • Support and evolve the OpenSearch, Redis, and SQS + RabbitMQ stacks.
  • Lead cross-team initiatives.
  • Write design docs.
  • Influence without authority.
  • Be pragmatic during incidents, focused on preventing the next one.

Benefits

  • Some of the nicest and smartest teammates you’ll ever work with
  • Competitive salaries
  • Comprehensive healthcare benefits
  • Exciting and motivating equity
  • Flexible PTO
  • 401k
  • Parental Leave
  • Commuter Benefits (SF office employees)
  • WFH Stipend
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service