Senior Site Reliability Engineer II

Remitly•Boca Raton, FL

51d•Hybrid

About The Position

LexisNexis Risk Solutions is seeking a hands-on Senior Site Reliability Engineer (SRE) to actively build, operate, and improve the reliability of their production systems. This role involves designing infrastructure, writing Terraform, enhancing observability, and responding to production incidents. The position can be fully remote for those not near an office, or hybrid for those who are. The company emphasizes that applicants are not restricted by job site or posting location.

Requirements

5+ years of hands-on experience in SRE, DevOps, or Infrastructure Engineering roles
Strong production experience in AWS
Significant hands-on experience with Terraform in real-world environments
Experience operating monitoring and uptime platforms such as Grafana, Pingdom, and Uptrends
Strong Linux systems, networking, and troubleshooting skills
Experience supporting production systems through incident response and on-call rotations
Proficiency with GitHub and modern Git workflows
Experience building or maintaining CI/CD pipelines with Azure DevOps
Familiarity with ITSM and incident workflows using ServiceNow
Strong written communication skills with experience documenting systems and processes in Confluence
Ability to work independently in a remote or hybrid environment

Nice To Haves

Experience defining and operating against SLOs and error budgets
Infrastructure-as-Code best practices beyond Terraform (modules, testing, CI integration)
Experience with containers and orchestration (Docker, Kubernetes)
Experience supporting large-scale, high-availability production systems
Prior experience mentoring engineers or serving as a technical lead

Responsibilities

Design, build, and operate highly available, scalable systems in AWS
Write, maintain, and review Terraform to provision and manage infrastructure
Own and improve monitoring, alerting, and observability using Grafana, Pingdom, and Uptrends
Participate in a rotating on-call schedule, responding to production incidents and driving issues to resolution
Lead incident response, root cause analysis, and post-incident reviews with a focus on prevention and automation
Define and manage SLOs, SLIs, and error budgets
Build and improve CI/CD pipelines and operational workflows using Azure DevOps and GitHub
Work directly with application teams to improve reliability, performance, and deployability
Automate manual operational tasks to reduce toil
Maintain clear, actionable runbooks and documentation in Confluence
Track work, incidents, and operational improvements using Jira and ServiceNow
Mentor other engineers and help set SRE standards and best practices