Lead Site Reliability Engineer

Empower•Overland Park, KS

3d•Hybrid

About The Position

Our vision for the future is based on the idea that transforming financial lives starts by giving our people the freedom to transform their own. We have a flexible work environment, and fluid career paths. We not only encourage but celebrate internal mobility. We also recognize the importance of purpose, well-being, and work-life balance. Within Empower and our communities, we work hard to create a welcoming and inclusive environment, and our associates dedicate thousands of hours to volunteering for causes that matter most to them. Chart your own path and grow your career while helping more customers achieve financial freedom. Empower Yourself. Applicants must be authorized to work for any employer in the U.S. We are unable to sponsor or take over sponsorship of an employment visa at this time, including CPT/OPT. The Lead Site Reliability Engineer will combine deep technical expertise with team leadership to drive reliability across Empower’s financial services platform. You will lead SREs in solving complex operational challenges, establish technical standards, and advise engineering leadership on infrastructure strategy and reliability initiatives.

Requirements

Bachelor’s degree in Computer Science, Information Technology, or related field (or equivalent practical experience).
7 to 10 years of Site Reliability Engineering experience (or equivalent), with demonstrated technical leadership.
Proven ability to lead technical teams and drive complex projects to completion.
Expert AWS knowledge, including designing large-scale, multi-region architectures.
Deep Kubernetes expertise, including advanced features, security, and production-scale operations.
Mastery of Infrastructure as Code using Terraform, including building shared platforms and frameworks.
Strong software engineering background with production experience in Python and/or Go.
Extensive experience with observability platforms (Datadog, Splunk) and implementing monitoring at scale.
Deep understanding of CI/CD principles and experience implementing enterprise-grade pipelines.
Proven track record leading major incidents and conducting effective postmortems.
Strong understanding of security, networking, and infrastructure design patterns.
Strong communication skills with ability to explain complex technical concepts to diverse audiences.
Experience mentoring engineers and building technical capabilities in teams.

Nice To Haves

Previous technical leadership roles (Lead, Staff, or similar) in SRE or Operational Excellence.
Financial services industry experience with understanding of regulatory requirements.
Expertise in compliance frameworks (SOC 2, PCI DSS, FINRA).
AWS certifications (Professional level).
Kubernetes certifications (CKA, CKAD, CKS).
Experience implementing SRE at organizations with 500+ engineers.
Background in chaos engineering, game days, and reliability testing practices.
Contributions to open-source projects with demonstrated community leadership.
Experience with service mesh implementation and management.
Track record of speaking at conferences or writing technical content.

Responsibilities

Lead cross-functional reliability initiatives across multiple value streams and coordinate execution across teams.
Define and evolve SRE best practices, tools, and methodologies across the organization.
Architect enterprise-scale, multi-region AWS infrastructure that balances reliability, cost, performance, and security.
Establish and operate SLOs, SLIs, and error budgets for critical services, using them to drive prioritization decisions.
Serve as incident commander for major incidents and drive postmortems that produce completed action items and organizational learning.
Lead disaster recovery planning for critical financial services infrastructure.
Build shared Infrastructure as Code foundations in Terraform (reusable modules, standards, and patterns adopted across teams).
Design and implement production-scale Kubernetes patterns, including multi-tenancy, security policies, and advanced scheduling.
Establish observability standards and strategies using Datadog and Splunk (metrics, logging, tracing, dashboards, and alerting).
Set CI/CD standards and patterns, including pipeline-as-code and progressive delivery at scale.
Lead chaos engineering, game days, and systematic reliability testing initiatives.
Drive FinOps initiatives to optimize cloud spend while maintaining reliability targets.
Lead a functional team of SREs (without direct reports) on projects and operational initiatives.
Mentor SREs at multiple levels through coaching, design reviews, code reviews, and training sessions.
Partner with Engineering, Product, and Security leadership to align reliability work with business priorities, zero-trust architecture, and compliance controls.

Benefits

Medical, dental, vision and life insurance
Retirement savings – 401(k) plan with generous company matching contributions (up to 6%), financial advisory services, potential company discretionary contribution, and a broad investment lineup
Tuition reimbursement up to $5,250/year
Business-casual environment that includes the option to wear jeans
Generous paid time off upon hire – including a paid time off program plus ten paid company holidays and three floating holidays each calendar year
Paid volunteer time — 16 hours per calendar year
Leave of absence programs – including paid parental leave, paid short- and long-term disability, and Family and Medical Leave (FMLA)
Business Resource Groups (BRGs) – BRGs facilitate inclusion and collaboration across our business internally and throughout the communities where we live, work and play. BRGs are open to all.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume