Site Reliability Engineer II

Renishaw•Alpharetta, GA

20h•$71,600 - $119,400

About The Position

LexisNexis Risk Solutions is a key partner in risk assessment, offering solutions within its Business Services vertical to help businesses enhance revenue growth, optimize operational efficiencies, and improve customer experience. Their solutions address critical issues in Anti-Money Laundering/Counter Terrorist Financing, Identity Authentication & Verification, Fraud and Credit Risk mitigation, and Customer Data Management. This specific role focuses on improving the reliability and usability of a core internal platform, aiming to reduce operational burden, enable partner teams to operate with greater confidence, and enhance the long-term health of the Kubernetes ecosystem. It is an ideal fit for individuals who enjoy tackling complex reliability challenges, simplifying intricate systems, and supporting others on a shared platform.

Requirements

Experience operating Kubernetes in production, ideally Azure Kubernetes Service
Practical experience across core SRE practices such as monitoring, alerting, incident response, capacity planning, and automation
Good Understanding of distributed systems behavior, failure modes, and dependency management
Experience automating infrastructure and operations using tools such as Terraform, Helm, GitHub Actions
Experience with at least one programming or scripting language used for automation and tooling (Python, Bash)
Experience designing systems that favor reliability, simplicity, and clear ownership over ad hoc fixes
Comfort participating in on call rotations and leading or supporting incidents in a calm, structured way
Ability to influence without authority and work effectively with multiple partner teams
A mindset oriented toward root cause analysis, long term fixes, and continuous improvement

Nice To Haves

Familiarity with service meshes, ingress patterns, and zero trust networking concepts
Experience with cloud cost optimization in Kubernetes environments
Prior exposure to internal platform or enablement teams

Responsibilities

Own reliability and resilience outcomes for an internal AKS fleet used by multiple partner teams
Design, implement, and improve Kubernetes platform capabilities such as cluster lifecycle management, workload isolation, autoscaling, and safe multi tenancy
Lead and execute toil reduction initiatives through automation, self service workflows, and strong platform defaults
Build and evolve observability across metrics, logs, and traces, with a focus on distributed system dependencies and actionable signals
Improve incident response by automating detection, recovery, and mitigation to protect service levels
Participate in an on call rotation, act as an incident responder, and support others during high impact events
Contribute to SRE processes such as incident reviews, error budgets, and reliability planning using practical experience
Provide informal mentorship and technical guidance to junior SREs and engineers on partner teams
Collaborate with security, networking, and application teams to align platform standards and reduce cross team friction
Continuously identify opportunities to simplify architecture, reduce operational overhead, and optimize cloud cost