Senior Site Reliability Engineer (SRE)

Voya Financial•Atlanta, GA

About The Position

Together we fight for everyone’s opportunity for a better financial future. We will do this together — with customers, partners and colleagues. We will fight for others, not against: We will stand up for and champion everyone’s access to opportunities. The status quo is not good enough … we believe every individual and every community deserves access to financial opportunities. We are determined to support both individuals and communities in reaching a better financial future. We know that reaching this future depends on our actions today. Like our Purpose Statement, Voya believes in being bold and committed to action. We are committed to a work environment where the differences that we are born with — and those we acquire throughout our lives — are understood, valued and intentionally pursued. We believe that our employees own our culture and have a responsibility to foster an environment where we all feel comfortable bringing our whole selves to work. Purposefully bringing our differences together to positively influence our culture, serve our clients and enrich our communities is essential to our vision. Are you ready to join a company with a strong purpose and a winning culture? Start your Voyage – We’re seeking a seasoned Site Reliability Engineer (SRE) who thrives at the intersection of software engineering, infrastructure, and AI systems. You’ll help ensure our platforms are scalable, reliable, and secure—while also contributing code, automation, and architectural improvements that support both traditional services and AI-driven workloads. This role is ideal for someone who thinks like a developer, understands AI infrastructure, and is passionate about reliability, observability, and operational excellence.

Requirements

5+ years of experience in SRE, DevOps, or software engineering roles.
Strong programming skills in languages such as Python, Java, etc.
Experience supporting AI/ML workloads (e.g., model training, inference, GPU orchestration).
Deep understanding of Linux systems, cloud platforms (Primarily Azure, AWS), and container orchestration.
Experience with infrastructure-as-code tools (Terraform, Ansible, GitHub, etc.).
Proficiency in monitoring and logging tools (Dynatrace, etc.).
Solid grasp of networking, security, and distributed systems.
Excellent communication and collaboration skills.

Nice To Haves

Experience with AI model observability, drift detection, or performance monitoring.
Contributions to open-source SRE, DevOps, or ML infrastructure tools.
Certifications in cloud platforms.

Responsibilities

Design, build, and maintain scalable infrastructure and automation tools for both traditional and AI-based systems.
Develop software solutions to improve system reliability and reduce manual toil.
Implement and manage CI/CD pipelines, including model deployment workflows.
Monitor system performance, availability, and security using modern observability tools.
Collaborate with data science and ML engineering teams to support AI/ML model training, serving, and lifecycle management.
Lead incident response, root cause analysis, and postmortem processes.
Advocate for SRE principles across engineering and AI teams.

Benefits

Health, dental, vision and life insurance plans
401(k) Savings plan – with generous company matching contributions (up to 6%)
Voya Retirement Plan – employer paid cash balance retirement plan (4%)
Tuition reimbursement up to $5,250/year
Paid time off – including 20 days paid time off, nine paid company holidays and a flexible Diversity Celebration Day.
Paid volunteer time — 40 hours per calendar year

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume