Senior Site Reliability Engineer

Akamai•Cambridge, MA

13d•$121,400 - $218,600•Hybrid

About The Position

Our team designs, develops, and manages applications and infrastructure that support Akamai Cloud's products and services. Our SRE teams solve reliability, security, and usability at scale for our global fleet while maintaining Akamai's mission at the forefront of what we do: make life better for billions of people, billions of times a day. In this role, you will focus on configuration management, IAC, and CI/CD. You will design, develop, and operate infrastructure deployment for the Akamai Cloud.

Requirements

5 years of relevant experience and a Bachelor's degree in Computer Engineering, Computer Science or equivalent
Possess advanced experience designing, implementing, and supporting enterprise CI/CD pipelines using tools such as Jenkins, GitHub Actions, or GitLab CI/CD.
Have extensive development skills in languages such as Python, Go, or Bash.
Have experience integrating security, compliance, and quality controls into CI/CD pipelines, including automated testing, artifact management, and deployment strategies
Utilize tools such as SaltStack, Terraform, Ansible, Chef, or Puppet to manage infrastructure as code effectively and efficiently.
Demonstrate advanced experience in a site reliability or software engineering role, working with large-scale distributed systems.

Responsibilities

Designing, developing, testing, and operating critical services that support the reliability, scalability, and performance of our infrastructure.
Designing and implementing observability solutions, including monitoring, logging, alerting, and telemetry capabilities, to proactively detect and resolve issues
Driving reliability improvements through automation, reducing operational toil and increasing the resilience of engineering processes.
Developing deep technical expertise in IAC systems and serving as a trusted technical resource, mentoring engineers and sharing best practices
Collaborating with software engineering, infrastructure, and platform teams to investigate complex production issues, identify root causes, and implement long-term corrective actions.
Participating in an on-call rotation and providing leadership during incident response, driving timely service restoration, effective communication, and post-incident improvement efforts.