Site Reliability Engineer II

RenishawAlpharetta, GA
$71,600 - $119,400

About The Position

LexisNexis Risk Solutions is a key partner in risk assessment, offering solutions within its Business Services vertical to help businesses enhance revenue growth, optimize operational efficiencies, and improve customer experience. Their solutions address critical issues in Anti-Money Laundering/Counter Terrorist Financing, Identity Authentication & Verification, Fraud and Credit Risk mitigation, and Customer Data Management. This specific role focuses on improving the reliability and usability of a core internal platform, aiming to reduce operational burden, enable partner teams to operate with greater confidence, and enhance the long-term health of the Kubernetes ecosystem. It is an ideal fit for individuals who enjoy tackling complex reliability challenges, simplifying intricate systems, and supporting others on a shared platform.

Requirements

  • Experience operating Kubernetes in production, ideally Azure Kubernetes Service
  • Practical experience across core SRE practices such as monitoring, alerting, incident response, capacity planning, and automation
  • Good Understanding of distributed systems behavior, failure modes, and dependency management
  • Experience automating infrastructure and operations using tools such as Terraform, Helm, GitHub Actions
  • Experience with at least one programming or scripting language used for automation and tooling (Python, Bash)
  • Experience designing systems that favor reliability, simplicity, and clear ownership over ad hoc fixes
  • Comfort participating in on call rotations and leading or supporting incidents in a calm, structured way
  • Ability to influence without authority and work effectively with multiple partner teams
  • A mindset oriented toward root cause analysis, long term fixes, and continuous improvement

Nice To Haves

  • Familiarity with service meshes, ingress patterns, and zero trust networking concepts
  • Experience with cloud cost optimization in Kubernetes environments
  • Prior exposure to internal platform or enablement teams

Responsibilities

  • Own reliability and resilience outcomes for an internal AKS fleet used by multiple partner teams
  • Design, implement, and improve Kubernetes platform capabilities such as cluster lifecycle management, workload isolation, autoscaling, and safe multi tenancy
  • Lead and execute toil reduction initiatives through automation, self service workflows, and strong platform defaults
  • Build and evolve observability across metrics, logs, and traces, with a focus on distributed system dependencies and actionable signals
  • Improve incident response by automating detection, recovery, and mitigation to protect service levels
  • Participate in an on call rotation, act as an incident responder, and support others during high impact events
  • Contribute to SRE processes such as incident reviews, error budgets, and reliability planning using practical experience
  • Provide informal mentorship and technical guidance to junior SREs and engineers on partner teams
  • Collaborate with security, networking, and application teams to align platform standards and reduce cross team friction
  • Continuously identify opportunities to simplify architecture, reduce operational overhead, and optimize cloud cost

Benefits

  • Annual incentive bonus
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service