Senior Site Reliability Engineer

McGraw Hill LLC.
$140,000 - $155,000Remote

About The Position

At McGraw Hill we create best-in-class, next-generation learning platforms that are used by millions of students and educators worldwide from kindergarten through graduate school. Our goal is to accelerate student success through intuitive and effective learning tools and content that maximize a teacher’s time and a student’s learning experience. We do all of this in a supportive, collaborative environment where you can grow your career in a way that fits into your life. How can you make an impact? We are hiring a Senior Site Reliability Engineer who will build and support reliable, high-capacity, and well-performing systems in support of our mission to protect and improve the McGraw Hill customer platforms, with an ever-watchful eye on reliability, security, performance, cost, and operational excellence. As a Sr Site Reliability Engineer, you will collaborate in a DevOps model with product development teams; designing, deploying, and managing automation tools that increase predictability as well as time to market while reducing cost This is a remote position open to applicants authorized to work for any employer within Canada.

Requirements

  • Experience developing, debugging, and deploying enterprise applications
  • Infrastructure automation and container orchestration experience (e.g., Terraform, EKS, ECS)
  • Strong troubleshooting across web, application, networking, OS, and database technologies
  • Experience with CI/CD, high-concurrency systems, and cloud-based production infrastructure
  • Proven problem-solving, communication, and root cause analysis skills
  • BS in Computer Science or related field (or equivalent experience)

Nice To Haves

  • Kubernetes/EKS experience preferred

Responsibilities

  • Partner with product teams in a DevOps model to design, deploy, and automate cloud infrastructure (AWS, Terraform)
  • Optimize system reliability, performance, scalability, and cost efficiency
  • Implement and maintain infrastructure-as-code and monitoring-as-code for transparency and repeatability
  • Own application reliability, uptime, security, capacity, and SLA performance
  • Lead major incident response, on-call support, and triage bridges
  • Enhance observability and telemetry to monitor customer experience, KPIs, and infrastructure health
  • Support secure, agile development practices in partnership with CyberSecurity (DevSecOps)
  • Drive resiliency efforts, including failure testing, capacity forecasting, and scaling plans
  • Mentor engineers and collaborate cross-functionally across stakeholder groups
  • Promote knowledge sharing, automation-first practices, and continuous improvement (Kubernetes/EKS experience preferred)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service