What you'll do: Lead the design, implementation, and maintenance of highly available, scalable, and secure cloud-native infrastructure on Amazon Elastic Kubernetes Service (EKS). Develop and implement comprehensive observability strategies, including monitoring, logging, and alerting, to ensure the health and performance of our systems. Architect and optimize data pipelines to ensure efficient and reliable data flow across various platforms. Drive the continuous improvement of our CI/CD pipelines, promoting best practices for automated testing, deployment, and release management. Champion cloud-first strategies, leveraging the full capabilities of cloud platforms for infrastructure, services, and operations. Implement and enforce robust security practices across our infrastructure, applications, and data. Design and maintain Docker-based containerization solutions for our applications. Develop and maintain automation scripts and tools using Python, Bash, and PowerShell. Collaborate with development teams to ensure reliability is built into the software development lifecycle from inception. Troubleshoot complex production issues across various layers of the stack, identifying root causes and implementing preventative measures. Mentor and guide junior SREs, sharing knowledge and fostering a culture of operational excellence. Participate in on-call rotations to support production systems. What you need: 10+ years of experience in Site Reliability Engineering, DevOps, or a similar role with a strong focus on operational excellence. Deep expertise in Amazon EKS, including cluster provisioning, management, and troubleshooting. Extensive experience with observability tools and practices, including Prometheus, Grafana, ELK stack, or similar. Proven track record in designing and implementing robust data pipelines (e.g., Kafka, Airflow, Spark). Strong background in CI/CD methodologies and tools (e.g., Jenkins, GitLab CI, ArgoCD). Expert-level knowledge of cloud platforms (AWS preferred), including infrastructure-as-code principles. Comprehensive understanding of security best practices for cloud environments, applications, and data. Proficiency in Docker for containerization and orchestration. Advanced scripting and programming skills in Python, Bash, and PowerShell. Solid understanding of networking concepts, distributed systems, and operating systems. Excellent problem-solving, analytical, and communication skills. Ability to work independently and as part of a highly collaborative team. Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience. Please note that this job description is intended to provide a general overview of the position and does not include an exhaustive list of responsibilities and qualifications.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Industry
Transportation Equipment Manufacturing
Number of Employees
501-1,000 employees