Lead Site Reliability Engineer - Remote

CentralSquare Technologies•,

9d•Remote

About The Position

We are seeking a highly skilled Senior Cloud / DevOps Engineer with a strong background in AWS, automation, infrastructure as code, and networking to support and modernize our cloud environments. This role is hands-on and will partner closely with Cloud Operations, SREs, Networking, and Application teams to improve scalability, reliability, security, and operational efficiency across mission‑critical systems. The ideal candidate is comfortable operating at both the infrastructure and application layers, has strong troubleshooting skills, and can automate repeatable operational tasks while supporting high‑availability production workloads.

Requirements

Strong background in AWS
Strong background in automation
Strong background in infrastructure as code
Strong background in networking
Comfortable operating at both the infrastructure and application layers
Strong troubleshooting skills
Ability to automate repeatable operational tasks
Experience supporting high-availability production workloads
Experience with Terraform, CloudFormation, or equivalent
Experience with CI/CD pipelines
Experience with Python, Bash, PowerShell, or similar scripting languages
Experience with cloud networking (VPCs, subnets, routing, VPNs, security groups, NACLs)
Experience with AWS Well-Architected Framework

Nice To Haves

Reduced manual operational work through automation
Improved deployment reliability and production stability
Faster recovery and clearer root cause analysis during incidents
Strong partnership with CloudOps, Networking, and Application teams

Responsibilities

Design, build, and maintain AWS-based infrastructure supporting production and non-production environments
Implement and maintain Infrastructure as Code (IaC) using tools such as Terraform, CloudFormation, or equivalent
Develop and support CI/CD pipelines for infrastructure and application deployments
Partner with application teams to improve deployment reliability and performance
Create and maintain automation scripts and tooling (Python, Bash, PowerShell, etc.) to reduce manual operations
Improve system reliability through self-healing mechanisms, monitoring, and alerting
Support SRE-style practices including incident response, root cause analysis, and continuous improvement
Design and support cloud networking (VPCs, subnets, routing, VPNs, security groups, NACLs)
Troubleshoot complex network, connectivity, and performance issues across hybrid environments
Implement security best practices aligned with AWS Well-Architected Framework
Participate in on-call rotations supporting critical production systems
Provide operational support, troubleshooting, and resolution for cloud-related incidents
Collaborate across CloudOps, Networking, DBAs, and Application teams
Document architectures, runbooks, and operational procedures