Senior Database Administrator

Hard Rock Digital•United States, FL

2d•Hybrid

About The Position

Hard Rock Digital is building a team focused on becoming the best online sportsbook, casino, and social gaming company in the world. They are looking for a Senior Database Administrator to own production reliability for their CockroachDB and PostgreSQL fleet. This role involves diagnosing and resolving issues in high-volume transaction systems, writing tooling to improve incident response, and setting operational standards. The position also involves utilizing AI coding agents to enhance engineering workflows and ensure the accuracy and speed of database transactions for millions of consumers.

Requirements

5+ years operating production relational databases as a DBA, Database Reliability Engineer, or Data Platform Engineer.
Deep production experience with PostgreSQL, CockroachDB, or another relational database management system, with fluency in the tradeoffs behind isolation, consensus, locality, and query planning.
Understanding of how a SQL engine works under the hood: MVCC, the Volcano iterator execution model, and cost-based optimizer frameworks such as Cascades. Application of this knowledge to indexing strategy and to execution plan analysis.
Working knowledge of distributed-systems fundamentals: consensus (Raft and Paxos), distributed transactions, consistency models, and failure modes.
Ability to write tools and automation in Python or Go, and debug and suggest fixes to line-of-business code in Java or a similar language.
Experience with infrastructure-as-code and a monitoring and observability stack such as Grafana or Datadog.
UNIX/Linux administration with shell scripting, comfort on a cloud platform (AWS preferred, Azure, or GCP welcome), and willingness to join an on-call rota.
Clear technical writing and speech.

Nice To Haves

Shipped Go or Python tooling for database operations, performance analysis, migration safety, or incident response that other engineers then used.
Investigated production incidents end-to-end using metrics, logs, query fingerprints, execution plans, and source-level reasoning.
Owned a database platform through on-call, post-incident repair, standards, and cross-team adoption.
Used AI coding agents in real engineering work, with review habits that keep generated output tied to evidence.
Run CockroachDB in production.
Contributed to an open-source database such as PostgreSQL.
Worked with streaming infrastructure such as Kafka and Materialize.
Worked with a modern data warehouse or analytics platform such as ClickHouse Cloud or Snowflake.
Practiced SRE along the lines of the Google SRE books.

Responsibilities

Operate and evolve multi-region CockroachDB clusters and PostgreSQL instances across production, staging, and development.
Investigate contention, serialization retries (SQLSTATE 40001), and range hotspots in the transaction paths behind bets, wagers, settlements, and payouts.
Diagnose leaseholder placement, monotonic-key write pressure, and cross-region latency that turn one query into several network round trips.
Trace a slow query to its execution plan, the plan to optimizer behavior, and the behavior to the schema or application call path that caused it.
Catch plan regressions after statistics changes, schema changes, data growth, or releases, then turn the fix into a safer rollout pattern.
Assess schema-change and backfill risk before it reaches large tables: lock behavior, retry pressure, capacity impact, and rollback path.
Plan capacity, scaling events, and version upgrades that customers don't feel.
Build database observability with Prometheus, Grafana, Mimir, Loki, and Snowflake: dashboards backed by real queries, alerting on leading indicators, and SLO tracking.
Connect database symptoms to application behavior and business events so alerts point at causes.
Join the on-call rota for the data layer, lead incident analysis when the database is part of the failure, and drive blameless postmortems toward durable fixes.
Instrument the system to surface degradation before it becomes an incident.
Build Go and Python tools that make incidents easier to understand: log collectors, explain-plan analyzers, migration checks, capacity models, and runbook generators.
Automate provisioning, configuration, backup, and recovery with Terraform and other infrastructure-as-code tools.
Work with platform engineering on CI/CD, deployment safety, and change management for database-touching services.
Use Claude Code, Codex, and similar harnesses in daily work to investigate incidents, generate probes, draft runbooks, and write tooling.
Keep the harness grounded in logs, traces, metrics, and source code, and verify its output against production facts before you ship it.
Evaluate AI-assisted observability and anomaly detection, and share what works with the team.
Implement and audit access controls, authentication, authorization, and encryption in transit and at rest.
Support data-protection, audit-logging, and retention requirements for a regulated gaming platform.
Build a database center of excellence: documentation, standards, and reusable patterns that other teams build against.
Mentor engineers on database fundamentals, distributed-systems behavior, and operational discipline.
Read CockroachDB and PostgreSQL internals when a problem needs a code-level answer, and bring what you learn back to the team.

Benefits

Competitive pay and benefits
Flexible vacation allowance
Employer-sponsored training and conference attendance
Opportunity to work in an AI-first environment with access to the tools you need to excel at your job

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume