API Reliability Engineer

Empower

About The Position

Our vision for the future is based on the idea that transforming financial lives starts by giving our people the freedom to transform their own. We have a flexible work environment, and fluid career paths. We not only encourage but celebrate internal mobility. We also recognize the importance of purpose, well-being, and work-life balance. Within Empower and our communities, we work hard to create a welcoming and inclusive environment, and our associates dedicate thousands of hours to volunteering for causes that matter most to them. Chart your own path and grow your career while helping more customers achieve financial freedom. Empower Yourself. Applicants must be authorized to work for any employer in the U.S. We are unable to sponsor or take over sponsorship of an employment visa at this time, including CPT/OPT. We are seeking an API Reliability Engineer with strong backend engineering experience to build and operate reliable, scalable, and high-performance API services using Java and Spring Boot. This role requires hands-on experience developing and running APIs in production environments, along with the ability to troubleshoot complex issues and improve system resilience. You will work closely with API developers and platform teams to ensure services are designed and implemented with reliability, observability, and scalability built in, and play a key role in both incident resolution and prevention.

Requirements

Minimum 5 years of experience in backend or API development.
Strong hands-on experience with Java and Spring Boot.
Proven experience building, shipping, and operating APIs in production environments.
Strong problem-solving skills with the ability to debug real production issues end-to-end.
Experience handling P1/P2 incidents in production environments.
Solid understanding of API architecture, request lifecycle, and common failure patterns.
Experience with AWS services, including API Gateway, ALB/NLB, EC2, ECS/EKS, Lambda, RDS, or DynamoDB.
Familiarity with reliability patterns such as timeouts, retries, circuit breakers, and connection pooling.
Experience with observability tools such as Datadog and/or Splunk.
Experience with CI/CD pipelines, preferably Jenkins.
Strong debugging skills in distributed systems.
Experience with Git-based workflows and Agile development.
Bachelor’s in Computer Science, Information Systems, or a related field; equivalent practical experience welcomed.

Nice To Haves

AWS certifications such as Solutions Architect or Developer Associate.
Experience with microservices and distributed system design.
Exposure to SLAs/SLOs and service health metrics.
Experience with Docker and Kubernetes.
Familiarity with API gateways, traffic routing, and load balancing strategies.
Experience in performance tuning and scalability improvements.
Strong communication skills during high-severity incidents.

Responsibilities

Own and improve the reliability, performance, and scalability of API services in production.
Troubleshoot and resolve P1/P2 production incidents end-to-end, analyzing issues across application, infrastructure, and integrations.
Work closely with API developers to identify and address reliability issues and application-level security vulnerabilities in service design and implementation.
Contribute targeted code-level or configuration fixes to resolve issues and prevent recurrence.
Participate in root cause analysis (RCA) and drive durable, long-term fixes.
Improve API resilience through patterns such as timeouts, retries, circuit breakers, and graceful degradation.
Establish and enhance observability and service health metrics, including logs, metrics, traces, and SLOs, using Datadog and Splunk.
Define and monitor SLAs/SLOs for API performance and availability.
Work with API Gateway and ALB/NLB for traffic management, routing, and system reliability.
Contribute to CI/CD pipelines using Jenkins to ensure safe and consistent deployments.
Contribute to disaster recovery readiness and system resilience planning.
Collaborate across engineering teams to improve system design and operational readiness.
Participate in an on-call rotation for critical incidents (P1/P2).

Benefits

Medical, dental, vision and life insurance
Retirement savings – 401(k) plan with generous company matching contributions (up to 6%), financial advisory services, potential company discretionary contribution, and a broad investment lineup
Tuition reimbursement up to $5,250/year
Business-casual environment that includes the option to wear jeans
Generous paid time off upon hire – including a paid time off program plus ten paid company holidays and three floating holidays each calendar year
Paid volunteer time — 16 hours per calendar year
Leave of absence programs – including paid parental leave, paid short- and long-term disability, and Family and Medical Leave (FMLA)
Business Resource Groups (BRGs) – BRGs facilitate inclusion and collaboration across our business internally and throughout the communities where we live, work and play. BRGs are open to all.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume