Senior Data Reliability Engineer

ZetaBasking Ridge, NJ

About The Position

As a Senior Data Reliability Engineer, you will be responsible for architecting, scaling, and optimizing enterprise-grade data platforms, including large-scale data lakes and data warehouses built from multiple disparate data sources. This role requires deep expertise in cloud databases, data infrastructure reliability, observability, and automation, with a strong focus on operational excellence, performance, and resilience.

Requirements

  • Deep expertise in PostgreSQL administration and performance tuning, preferably in AWS RDS environments.
  • Strong experience with Debezium, Kafka Connect, ETL frameworks/tools, and enterprise-grade data pipeline architectures.
  • Strong hands-on experience with Amazon Redshift, S3, and cloud-native data platforms.
  • Expertise in Apache Airflow workflow orchestration and operational management.
  • Experience with Apache Spark and large-scale distributed data processing.
  • Strong scripting and automation experience using Python, Bash, or similar languages.
  • Strong experience in Infrastructure as Code (IaC) using Terraform, Crossplane, or equivalent tools.
  • Hands-on experience with monitoring and observability tools such as CloudWatch, Prometheus, Grafana.
  • Strong understanding of cloud database security, compliance, and governance frameworks (e.g., GDPR, HIPAA).
  • Experience designing highly available, fault-tolerant, and scalable cloud database systems.
  • Bachelor’s degree in computer science, Information Technology, or a related field (master’s preferred).
  • 10–12 years of overall experience in database engineering, cloud data infrastructure, or reliability engineering.
  • Minimum 5+ years of hands-on experience with PostgreSQL, including AWS RDS administration.
  • Strong experience in cloud-native data platforms and enterprise-scale production environments.

Nice To Haves

  • AWS Certified Database - Specialty or relevant cloud certifications preferred.

Responsibilities

  • Own the reliability, availability, scalability, and performance of PostgreSQL RDS environments across production and non-production systems.
  • Lead proactive monitoring and observability initiatives for PostgreSQL RDS instances, leveraging tools such as CloudWatch, Prometheus, Grafana, and other enterprise monitoring platforms.
  • Drive advanced PostgreSQL performance tuning, including query optimization, indexing strategies, parameter tuning, and capacity planning.
  • Architect and optimize database backup, disaster recovery, and failover strategies to ensure business continuity and minimal downtime.
  • Own the reliability and operational excellence of Debezium and Kafka Connect ecosystems, ensuring robust real-time data ingestion and delivery.
  • Lead troubleshooting and optimization of ETL workflows and data pipelines, ensuring scalability, reliability, and fault tolerance across data platforms.
  • Oversee Apache Airflow workflow orchestration, ensuring high reliability, SLA adherence, and operational efficiency of production DAGs.
  • Design and implement Infrastructure as Code (IaC) solutions using tools such as Terraform, Crossplane, and automation frameworks to streamline deployments and operational tasks.
  • Lead incident response, root cause analysis, and post-incident reviews for critical production issues.
  • Define and enforce database security standards, including access controls, encryption policies, compliance adherence, and periodic security audits.
  • Partner closely with engineering, DevOps, and data platform teams to optimize data architecture and improve overall platform reliability.
  • Mentor junior engineers and drive best practices across database reliability engineering and cloud data operations.
  • Identify and lead continuous improvement initiatives focused on reliability, automation, scalability, and operational maturity.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service