Lead Site Reliability Engineer

McGraw Hill LLC.

76d•$140,000 - $167,000•Remote

About The Position

Make an Impact! At McGraw Hill, we create best-in-class, next-generation learning platforms that are used by millions of students and educators worldwide, from kindergarten through graduate school. Our goal is to accelerate student success through intuitive and effective learning tools and content that maximize a teacher’s time and a student’s learning experience. We do all of this in a supportive, collaborative environment where you can grow your career in a way that fits into your life. How can you make an impact? We are hiring a Lead Site Reliability Engineer to build and support reliable, high-capacity, and high-performing core infrastructure services that enable us to reimagine learning for millions of students and educators worldwide. You will lead cross-functional teams to design, deploy, and manage foundational infrastructure services while driving initiatives to enhance system reliability, performance, and scalability. If you thrive in building developer tools, automating processes, solving cloud-related challenges, and mentoring engineers, this role is for you. This is a remote position open to applicants authorized to work for any employer within Canada.

Requirements

Proven experience building and managing large-scale systems and tools in AWS using repeatable and maintainable methods.
Expertise in Kubernetes (EKS or managing clusters) and container orchestration technologies.
Proficiency in infrastructure automation tools like Terraform or CloudFormation.
Strong programming skills in Python, Golang, or Bash, with a focus on production software development.
Experience with CI/CD pipelines, GitOps tools (ArgoCD, FluxCD), and observability platforms (NewRelic, CloudWatch, DataDog).
Versatility in troubleshooting hosting technologies, including web servers, application platforms, operating systems, and network components.
Strong communication, problem-solving, and systems engineering skills.
A proactive mindset and ability to work across team boundaries daily.
A degree in Computer Science or equivalent industry experience.

Responsibilities

Collaborate with product development teams in a DevOps model to design, deploy, and manage automation tools that enhance predictability and accelerate time to market.
Optimize existing systems to ensure "right-sized" solutions that balance technical and business constraints.
Drive initiatives to improve system reliability and performance.
Ensure repeatability, traceability, and transparency of infrastructure automation using Infrastructure-as-Code (IaC).
Actively monitor AWS costs and use optimization tools to maximize ROI while meeting Service Level Objectives.
Own reliability, uptime, system security, cost, operations, capacity, resiliency, and performance analysis.
Lead initiatives to improve application and platform reliability using data-driven analytics.
Ensure architecture and deployment models meet SLA commitments.
Maintain and enhance telemetry systems to improve visibility into application performance and business metrics.
Develop and monitor standard processes to promote the long-term health and sustainability of operational tasks.