Staff Database Reliability Engineer

Scribe

3d•Remote

About The Position

We're hiring a Staff Database Reliability Engineer to own the strategy, architecture, and operational excellence of our data infrastructure. This is an expert-level IC role with deep influence on engineering direction, partnering closely with platform, backend, and DevOps engineers. You will own the data tier end-to-end. Design schemas and access patterns that scale, tune Aurora for latency and throughput, and set the standards for how engineers interact with our databases. When a migration script seizes up mid-deploy and writes start queueing behind an ACCESS EXCLUSIVE lock, your runbooks and automation resolve the incident quickly. Make the Django ORM a strength, not a liability: Review migrations for safety at scale — locks, backfills, concurrent index builds, NOT VALID constraints Catch N+1 patterns and missing select_related/prefetch_related in review Establish conventions for QuerySet usage and physical schema design (indexes, constraints, partitioning) Scale review through automation, not heroics — author AGENTS.md files and DNA scaffolding that encode our conventions, configure AI review bots (Claude Code, Cursor, etc.) to flag risky migrations and ORM anti-patterns, and iterate on those configs as new failure modes emerge Lead major infrastructure initiatives: Capacity planning as traffic and engineering throughput grow Zero-downtime schema migrations and cutovers Multi-AZ resilience within a single region — Aurora writer/reader placement, failover behavior and RTO/RPO, ElastiCache and OpenSearch AZ topology, RabbitMQ survivability across AZs Backups, PITR, failover testing, retention Own the CDC pipeline (Aurora → DMS → S3 Parquet → Snowflake): DMS task design and tuning, replication slot hygiene on the Postgres side Schema evolution as Django migrations roll through — so a column rename doesn't silently break the warehouse at 6 AM Parquet layout and partitioning, reliability of the Snowflake handoff Automated checks that flag migrations likely to break downstream consumers Drive observability across three complementary tools: pganalyze — query-level performance, index advisor, schema insights - the go-to for "why is this ORM query slow" CloudWatch — infrastructure metrics and alarms for Aurora, OpenSearch, ElastiCache, SQS, DMS Honeycomb — high-cardinality tracing that ties slow DB calls back to users, flags, deploys, and flows Shape how the three fit together, including Django-side instrumentation and trace attributes on ORM queries Build tooling and guardrails: Migration review automation and CI checks for risky patterns Slow query pipelines fed from pganalyze Self-service dashboards so teams understand their own query footprint Support and evolve the rest of the stack: OpenSearch — index design, sharding, mapping changes, reindexing strategy, Django-side indexing pipelines Redis — caching patterns, eviction, sizing, Django cache framework, Celery/RQ usage, avoiding hot keys and thundering herds SQS + RabbitMQ — queue design, DLQs, visibility timeouts, exchange/queue topology, AZ mirroring, consumer backpressure, Celery behavior under load

Requirements

Deep PostgreSQL expertise (EXPLAIN (ANALYZE, BUFFERS), MVCC, bloat, lock contention, vacuum/autovacuum).
Strong ORM fluency (Django, SQLAlchemy, ActiveRecord, or similar) including predicting SQL, spotting N+1 problems, and controlling eager loading.
Practical understanding of single-region multi-AZ design.
Production CDC experience, ideally AWS DMS, including logical replication, slot hygiene, schema evolution, and Parquet-based data lakes feeding Snowflake.
Hands-on experience with pganalyze (or Datadog DBM / Performance Insights / pg_stat_statements pipelines), CloudWatch (custom metrics, composite alarms, log insights), and Honeycomb (or another high-cardinality tracing tool).
Comfort with OpenTelemetry.
Real experience making AI coding and review tools useful for a team, including writing AGENTS.md files and configuring review agents.
OpenSearch at scale experience (sizing, sharding, JVM tuning, rolling upgrades, snapshots).
Production Redis experience (persistence tradeoffs, cluster mode, hot keys, thundering herds).
Experience with at least one production message broker (SQS, RabbitMQ, Kafka), including delivery semantics, idempotency, and failure modes.
Strong automation and IaC background with real code (Python, Go, or similar) and Terraform.
Comfortable in a high-growth environment.

Nice To Haves

Aurora Serverless V2 / Limitless experience strongly preferred (storage model, reader/writer split, ACU scaling).
Opinionated about what makes a trace useful.
Versioning and iterating on AI prompts and configs.

Responsibilities

Own the data tier end-to-end.
Design schemas and access patterns that scale.
Tune Aurora for latency and throughput.
Set the standards for how engineers interact with our databases.
Resolve incidents quickly using runbooks and automation when migration scripts seize up mid-deploy and writes start queueing.
Review migrations for safety at scale (locks, backfills, concurrent index builds, NOT VALID constraints).
Catch N+1 patterns and missing select_related/prefetch_related in review.
Establish conventions for QuerySet usage and physical schema design (indexes, constraints, partitioning).
Scale review through automation by authoring AGENTS.md files and DNA scaffolding, configuring AI review bots to flag risky migrations and ORM anti-patterns, and iterating on those configurations.
Lead major infrastructure initiatives including capacity planning, zero-downtime schema migrations and cutovers, multi-AZ resilience, and backups/PITR/failover testing.
Own the CDC pipeline (Aurora → DMS → S3 Parquet → Snowflake), including DMS task design and tuning, replication slot hygiene, schema evolution, Parquet layout and partitioning, and Snowflake handoff reliability.
Develop automated checks that flag migrations likely to break downstream consumers.
Drive observability across pganalyze, CloudWatch, and Honeycomb, including Django-side instrumentation and trace attributes on ORM queries.
Build tooling and guardrails such as migration review automation, CI checks for risky patterns, slow query pipelines, and self-service dashboards.
Support and evolve the OpenSearch, Redis, and SQS + RabbitMQ stacks.
Lead cross-team initiatives.
Write design docs.
Influence without authority.
Be pragmatic during incidents, focused on preventing the next one.