Senior SRE

Renishaw•Alpharetta, GA

25d

About The Position

About the Business: LexisNexis Risk Solutions is the essential partner in the assessment of risk. Within our Business Services vertical, we offer a multitude of solutions focused on helping businesses of all sizes drive higher revenue growth, maximize operational efficiencies, and improve customer experience. Our solutions help our customers solve difficult problems in the areas of Anti-Money Laundering/Counter Terrorist Financing, Identity Authentication & Verification, Fraud and Credit Risk mitigation and Customer Data Management. You can learn more about LexisNexis Risk at the link below, https://risk.lexisnexis.com About This Role: This role directly shapes the reliability and usability of a core internal platform. Your work will reduce operational burden across the organization, enable partner teams to move faster with confidence, and improve the long term health of our Kubernetes ecosystem. If you enjoy solving hard reliability problems, simplifying complex systems, and helping others succeed on a shared platform, this role is a strong fit.

Requirements

Strong hands on experience operating Kubernetes in production, ideally Azure Kubernetes Service
Practical experience across core SRE practices such as monitoring, alerting, incident response, capacity planning, and automation
Solid understanding of distributed systems behavior, failure modes, and dependency management
Experience automating infrastructure and operations using tools such as Terraform, Helm, GitHub Actions
Proficiency with at least one programming or scripting language used for automation and tooling (Python, Bash)
Experience designing systems that favor reliability, simplicity, and clear ownership over ad hoc fixes
Comfort participating in on call rotations and leading or supporting incidents in a calm, structured way
Ability to influence without authority and work effectively with multiple partner teams
A mindset oriented toward root cause analysis, long term fixes, and continuous improvement

Nice To Haves

Familiarity with service meshes, ingress patterns, and zero trust networking concepts
Experience with cloud cost optimization in Kubernetes environments
Prior exposure to internal platform or enablement teams

Responsibilities

Own reliability and resilience outcomes for an internal AKS fleet used by multiple partner teams
Design, implement, and improve Kubernetes platform capabilities such as cluster lifecycle management, workload isolation, autoscaling, and safe multi tenancy
Lead and execute toil reduction initiatives through automation, self service workflows, and strong platform defaults
Build and evolve observability across metrics, logs, and traces, with a focus on distributed system dependencies and actionable signals
Improve incident response by automating detection, recovery, and mitigation to protect service levels
Participate in an on call rotation, act as an incident responder, and support others during high impact events
Contribute to SRE processes such as incident reviews, error budgets, and reliability planning using practical experience
Provide informal mentorship and technical guidance to junior SREs and engineers on partner teams
Collaborate with security, networking, and application teams to align platform standards and reduce cross team friction
Continuously identify opportunities to simplify architecture, reduce operational overhead, and optimize cloud cost