Site Reliability Engineer

Akamai•Cambridge, MA

6h•$75,700 - $136,300•Hybrid

About The Position

Our team designs, develops, and manages applications and infrastructure that support Akamai Cloud's products and services. Our SRE teams solve reliability, security, and usability at scale for our global fleet while maintaining Akamai's mission at the forefront of what we do: make life better for billions of people, billions of times a day. In this role, you will focus on configuration management, IAC, and CI/CD. You will design, develop, and operate infrastructure deployment for the Akamai Cloud.

Requirements

Relevant experience and a Bachelor's degree in Computer Engineering, Computer Science or equivalent
Demonstrate experience in a Site Reliability or Software Engineering role, working with large-scale distributed systems.
Have experience with Terraform, including module development, state management, workspace design, policy enforcement, and enterprise-scale Infrastructure as Code implementations
Have experience managing Infrastructure as Code solutions using tools such as Terraform, SaltStack, Ansible, Chef, Puppet, or similar technologies
Have experience with designing, developing, and deploying software and infrastructure at scale in a Linux environment.
Have great communication and interpersonal skills

Responsibilities

Designing, developing, testing, and operating critical services that support the reliability, scalability, and performance of our infrastructure.
Designing and implementing observability solutions, including monitoring, logging, alerting, and telemetry capabilities, to proactively detect and resolve issues
Driving reliability improvements through automation, reducing operational toil and increasing the resilience of engineering processes.
Developing technical expertise in IAC systems and serving as a trusted technical resource, mentoring engineers and sharing best practices
Collaborating with software engineering, infrastructure, and platform teams to investigate complex production issues, identify root causes, and implement long-term corrective actions.
Participating in an on-call rotation and providing leadership during incident response, driving timely service restoration, effective communication, and post-incident improvement efforts.