Senior Site Reliability Engineer (SRE) - AWS & GCP

DataArt•Belgrade, MT

29d

About The Position

Our client is revolutionizing the retail direct store delivery model by addressing key challenges like communication gaps, out-of-stocks, invoicing errors, and price inconsistencies. Through innovative technology and strong partnerships, they help boost sales, increase profits, and enhance customer loyalty. We are seeking a skilled Middle to Senior Site Reliability Engineer (SRE) with hands-on experience in both AWS and Google Cloud Platform (GCP) to join a fast-paced, innovative project team. This role requires proactive monitoring, automation, and optimization of cloud infrastructure to ensure high availability, scalability, and security of mission-critical retail solutions. The candidate should be available for at least four hours of overlapping work time with the New York time zone to ensure smooth collaboration and participation in team activities.

Requirements

4+ years of experience as an SRE, DevOps, or Platform Engineer.
Strong expertise in both AWS and GCP.
Experience with Kubernetes (EKS, ECS, or GKE).
Infrastructure automation with Terraform / CloudFormation and scripting (Python, Bash, Go).
Hands-on experience with monitoring tools (CloudWatch, Datadog, Prometheus, Grafana, ELK/EFK).
Knowledge of cloud networking, VPNs, and identity services (Active Directory, SSO).
Solid understanding of cloud security and DevSecOps practices.
Comfortable working in agile, collaborative team environments.
Excellent communication skills and ability to work with distributed teams.
Availability for a minimum of 4 hours overlap with New York time zone for meetings and collaboration.

Responsibilities

Design, build, and operate scalable and reliable systems on AWS and GCP cloud platforms
Develop and maintain automation scripts to improve deployment, monitoring, and incident response
Ensure system availability, latency, and overall reliability to meet service level objectives (SLOs)
Collaborate with development and operations teams to implement best practices for security, monitoring, and infrastructure management
Proactively troubleshoot and resolve infrastructure incidents and performance bottlenecks
Participate in on-call rotations and incident management processes
Continuously improve system architecture and automation to reduce manual intervention and improve efficiency
Support CI/CD pipelines and infrastructure as code (IaC) initiatives

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume