Data Reliability Engineer

EmpowerOverland Park, KS
Hybrid

About The Position

Our vision for the future is based on the idea that transforming financial lives starts by giving our people the freedom to transform their own. We have a flexible work environment, and fluid career paths. We not only encourage but celebrate internal mobility. We also recognize the importance of purpose, well-being, and work-life balance. Within Empower and our communities, we work hard to create a welcoming and inclusive environment, and our associates dedicate thousands of hours to volunteering for causes that matter most to them. Chart your own path and grow your career while helping more customers achieve financial freedom. Empower Yourself. Applicants must be authorized to work for any employer in the U.S. We are unable to sponsor or take over sponsorship of an employment visa at this time, including CPT/OPT. We are looking for a hands-on Data Reliability Engineer to own the reliability, stability, and operational excellence of our AWS-based data platform. This role is focused on operating, troubleshooting, and improving production data systems, ensuring that data pipelines and analytics platforms are resilient, performant, and meet business-critical SLAs. You will work closely with data and platform engineering teams to diagnose issues, resolve production incidents, and influence better design and operational practices across the data ecosystem.

Requirements

  • Minimum 5 years of experience working with production data platforms in AWS environments
  • Prior experience building data pipelines and seeing them through production, including exposure to real-world failures and operational challenges
  • Strong experience with Python and SQL in real data systems
  • Hands-on experience troubleshooting distributed data processing systems (e.g., Spark/EMR, Redshift, streaming systems)
  • Proven ability to debug and resolve production issues in data pipelines and data platforms
  • Experience with AWS data services (such as EMR, Redshift, DynamoDB, S3, or similar)
  • Experience handling production incidents and performing root cause analysis
  • Strong problem-solving mindset and ability to work through ambiguous production issues

Nice To Haves

  • Experience handling real-world data issues such as pipeline delays or failures
  • Experience with backfills and reprocessing
  • Experience with late-arriving or incomplete data
  • Experience improving observability and alerting specifically for data systems
  • Experience influencing or guiding data pipeline reliability and operational practices
  • Exposure to streaming/event-driven systems (Kafka, Kinesis, CDC patterns)
  • Experience with disaster recovery, backup validation, and resiliency testing
  • Strong communication during incidents with both technical and non-technical stakeholders

Responsibilities

  • Own the reliability and stability of production data pipelines and data platform services
  • Diagnose and resolve data pipeline failures, delays, and data quality issues in production environments
  • Investigate issues across distributed data systems (e.g., Spark/EMR workloads, ingestion pipelines, warehouse performance)
  • Lead or support incident response, including triage, mitigation, and long-term resolution
  • Perform root cause analysis (RCA) and implement durable fixes to prevent recurrence
  • Define and improve data SLAs (freshness, latency, completeness) and ensure adherence
  • Design and enhance monitoring, alerting, and observability for data systems
  • Develop automation and tooling to reduce operational toil and improve system resilience
  • Contribute to disaster recovery (DR) and resiliency planning, including backup validation and recovery workflows
  • Partner with engineering teams to improve pipeline design, reliability, and operational readiness
  • Create and maintain runbooks, SOPs, and operational documentation
  • Participate in occasional off-hours support for production data systems when required

Benefits

  • Medical, dental, vision and life insurance
  • Retirement savings – 401(k) plan with generous company matching contributions (up to 6%), financial advisory services, potential company discretionary contribution, and a broad investment lineup
  • Tuition reimbursement up to $5,250/year
  • Business-casual environment that includes the option to wear jeans
  • Generous paid time off upon hire – including a paid time off program plus ten paid company holidays and three floating holidays each calendar year
  • Paid volunteer time — 16 hours per calendar year
  • Leave of absence programs – including paid parental leave, paid short- and long-term disability, and Family and Medical Leave (FMLA)
  • Business Resource Groups (BRGs) – BRGs facilitate inclusion and collaboration across our business internally and throughout the communities where we live, work and play. BRGs are open to all.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service