About The Position

Our client is revolutionizing the retail direct store delivery model by addressing key challenges like communication gaps, out-of-stocks, invoicing errors, and price inconsistencies. Through innovative technology and strong partnerships, they help boost sales, increase profits, and enhance customer loyalty. We are seeking a skilled Middle to Senior Site Reliability Engineer (SRE) with hands-on experience in both AWS and Google Cloud Platform (GCP) to join a fast-paced, innovative project team. This role requires proactive monitoring, automation, and optimization of cloud infrastructure to ensure high availability, scalability, and security of mission-critical retail solutions. The candidate should be available for at least four hours of overlapping work time with the New York time zone to ensure smooth collaboration and participation in team activities.

Requirements

  • 4+ years of experience as an SRE, DevOps, or Platform Engineer.
  • Strong expertise in both AWS and GCP.
  • Experience with Kubernetes (EKS, ECS, or GKE).
  • Infrastructure automation with Terraform / CloudFormation and scripting (Python, Bash, Go).
  • Hands-on experience with monitoring tools (CloudWatch, Datadog, Prometheus, Grafana, ELK/EFK).
  • Knowledge of cloud networking, VPNs, and identity services (Active Directory, SSO).
  • Solid understanding of cloud security and DevSecOps practices.
  • Comfortable working in agile, collaborative team environments.
  • Excellent communication skills and ability to work with distributed teams.
  • Availability for a minimum of 4 hours overlap with New York time zone for meetings and collaboration.

Responsibilities

  • Design, build, and operate scalable and reliable systems on AWS and GCP cloud platforms
  • Develop and maintain automation scripts to improve deployment, monitoring, and incident response
  • Ensure system availability, latency, and overall reliability to meet service level objectives (SLOs)
  • Collaborate with development and operations teams to implement best practices for security, monitoring, and infrastructure management
  • Proactively troubleshoot and resolve infrastructure incidents and performance bottlenecks
  • Participate in on-call rotations and incident management processes
  • Continuously improve system architecture and automation to reduce manual intervention and improve efficiency
  • Support CI/CD pipelines and infrastructure as code (IaC) initiatives

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Publishing Industries

Education Level

No Education Listed

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service