Senior Site Reliability Engineer

McGraw Hill LLC.

76d•$140,000 - $155,000•Remote

About The Position

At McGraw Hill we create best-in-class, next-generation learning platforms that are used by millions of students and educators worldwide from kindergarten through graduate school. Our goal is to accelerate student success through intuitive and effective learning tools and content that maximize a teacher’s time and a student’s learning experience. We do all of this in a supportive, collaborative environment where you can grow your career in a way that fits into your life. How can you make an impact? We are hiring a Senior Site Reliability Engineer who will build and support reliable, high-capacity, and well-performing systems in support of our mission to protect and improve the McGraw Hill customer platforms, with an ever-watchful eye on reliability, security, performance, cost, and operational excellence. As a Sr Site Reliability Engineer, you will collaborate in a DevOps model with product development teams; designing, deploying, and managing automation tools that increase predictability as well as time to market while reducing cost This is a remote position open to applicants authorized to work for any employer within Canada.

Requirements

Experience developing, debugging, and deploying enterprise applications
Infrastructure automation and container orchestration experience (e.g., Terraform, EKS, ECS)
Strong troubleshooting across web, application, networking, OS, and database technologies
Experience with CI/CD, high-concurrency systems, and cloud-based production infrastructure
Proven problem-solving, communication, and root cause analysis skills
BS in Computer Science or related field (or equivalent experience)

Nice To Haves

Kubernetes/EKS experience preferred

Responsibilities

Partner with product teams in a DevOps model to design, deploy, and automate cloud infrastructure (AWS, Terraform)
Optimize system reliability, performance, scalability, and cost efficiency
Implement and maintain infrastructure-as-code and monitoring-as-code for transparency and repeatability
Own application reliability, uptime, security, capacity, and SLA performance
Lead major incident response, on-call support, and triage bridges
Enhance observability and telemetry to monitor customer experience, KPIs, and infrastructure health
Support secure, agile development practices in partnership with CyberSecurity (DevSecOps)
Drive resiliency efforts, including failure testing, capacity forecasting, and scaling plans
Mentor engineers and collaborate cross-functionally across stakeholder groups
Promote knowledge sharing, automation-first practices, and continuous improvement (Kubernetes/EKS experience preferred)