Lead Site Reliability Engineer

McGraw Hill LLC.

32d•$140,000 - $155,000•Remote

Save Job Match Resume

About The Position

McGraw Hill, a leading provider of digital educational resources and content, is seeking a Lead Site Reliability Engineer to lead a team of 6 Engineers for our Digital Platform Group in supporting our K–12 learning platforms. These platforms serve millions of students and educators nationwide, and you’ll play a key role in ensuring their reliability, scalability, and performance. Working closely with engineering and product teams, you’ll leverage your expertise in AWS, Terraform, and observability tools to drive automation, enhance resiliency, and maintain the health of our cloud-based infrastructure. This is a remote position open to applicants authorized to work for any employer within the United States.

Requirements

5+ years of experience in SRE, DevOps, or Software Engineering roles supporting enterprise applications.
Strong problem-solving, triage, and root cause analysis skills with a systems engineering mindset
Deep expertise in the AWS ecosystem, with hands-on experience across core services including primarily ECS, RDS, EKS, IAM, CloudWatch, and networking configurations.
Expertise with Terraform for managing and automating scalable cloud infrastructure
Skilled in CI/CD pipelines (e.g., GitHub Actions) and managing end-to-end software delivery lifecycles.
Strong familiarity with telemetry and observability tools (e.g., New Relic, Datadog), including querying logs and metrics for performance monitoring.

Responsibilities

Lead a 6 member SRE team supporting production infrastructure and services
Manage backlog, sprint planning, and team velocity
Own reliability, uptime, security, cost, and performance of services
Define and monitor SLOs for application workloads
Plan on-call rotations and work to reduce alert fatigue
Forecast seasonal growth and capacity planning
Mentor engineers and foster professional growth
Report status and issues to leadership monthly
Partner with development teams
Collaborate with CyberSecurity on risk mitigation
Collaborate with FinOps on cost reduction
Design and troubleshoot highly-distributed, cloud-based production systems
Maintain infrastructure-as-code and monitoring-as-code practices
Improve system resiliency through failure injection and chaos testing
Participate in on-call rotation and resolve operational issues
Optimize existing systems for performance and cost
Ensure telemetry provides visibility to application performance
Support agile development practices and code reviews

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Job Search Resources

Similar Lead Site Reliability Engineer job opportunities

Lead Site Reliability Engineer

Glean • Palo Alto, CA3d • $200,000 - $260,000 • Hybrid

Lead Site Reliability Engineer

Turion Space • Irvine, CA3d • $155,000 - $231,000

Lead Site Reliability Engineer

Stuut • San Francisco, CA5d

Lead Site Reliability Engineer

Glean • Palo Alto, CA3d • $200,000 - $260,000 • Hybrid

🔥 New JobLead Site Reliability Engineer

McGraw Hill LLC.9d • $124,000 - $155,000 • Remote

Senior Lead Site Reliability Engineer

JPMorganChase • Plano, TX1d

Explore More Jobs

© 2024 Teal Labs, Inc

Privacy Policy Terms of Service