Senior Site Reliability Engineer

Remitly•Philadelphia, CT

1d•$104,900 - $189,051

About The Position

We are looking to immediately hire a highly skilled and proactive Senior SRE to join our dynamic team. You will combine software thinking and service operations to enable and run Elsevier’s large-scale, 24x7, distributed and fault-tolerant systems within agreed reliability objectives, whilst enabling the fast flow of feature and service updates. The successful candidate will possess deep expertise in cloud-native architectures, along with strong automation skills. This diverse team of Engineers is assisting multiple product teams as we continue to innovate all of our products within our global Cloud AWS landscape.

Requirements

Extensive experience deploying, managing, and troubleshooting containerised applications.
Deep understanding of Kubernetes architecture, networking, security, storage, and operational best practices.
Proven expertise with AWS services and architectural principles.
Extensive knowledge of AWS security, compliance, and best practices.
Advanced skills in writing modular, reusable IaC components.
Strong Python scripting skills for automation, tooling, and data processing.
Ability to develop custom solutions for monitoring, automation, and incident management.
Experience designing and maintaining CI/CD workflows using GitHub Actions.
Current experience Automating deployment pipelines, testing, and validation processes.
Familiarity with monitoring tools such as NewRelic.
Knowledge of security best practices, network policies, and enterprise-grade RBAC policies.

Responsibilities

Designing, deploying, and maintaining highly available, scalable Kubernetes clusters on AWS EKS as well as the supporting ecosystem.
Managing and optimizing cross-portfolio cloud infrastructure, leveraging AWS services and supported organizational tooling
Developing and maintaining Infrastructure as Code (IaC) solutions to automate provisioning and management of cloud and Kubernetes resources.
Writing automation processes to streamline operational workflows, incident response, and infrastructure management.
Implementing CI/CD pipelines to facilitate deployments, testing, and validation.
Supporting multi-regional critical infrastructure, ensuring high availability and rapid incident resolution.
Monitoring system health, instrument system components, troubleshoot issues, and perform root cause analysis.
Managing and supporting a complex cross-portfolio environment, coordinating across teams to ensure consistency and reliability.
Maintaining comprehensive documentation and best practice guides for solutions, ensuring users have clear instructions and support to effectively implement and operate their systems.
Mentoring junior team members and promoting best practices in SRE, automation, and cloud architecture.