Senior Site Reliability Engineer

Akamai

9d•Remote

About The Position

Do you like collaborating across teams to solve complex problems? Do you enjoy solving large scale distributed systems problems? Join the Mapping SRE team! The Mapping SRE team is responsible for overseeing and improving availability, reliability, performance and change management procedures of Akamai's mapping system. Our system routes trillions of client requests per day, controlling tens of terabits per second of content traffic served to clients worldwide. Our team defines KPIs, advances the state of measurements, monitoring dashboards, alerts, and investigates complex production issues. Partner with the best In this role, you'll work closely with cross-functional teams to understand and improve the performance, availability and reliability of Akamai's Mapping Service. You'll define key performance indicators (KPIs), advance the state of monitoring, alerting and operational responses, and investigate complex performance issues. As a Senior Site Reliability Engineer, you will be responsible for: Monitoring, investigating, and analyzing performance and availability by (co)designing, managing, and tracking product-related SLIs/SLOs Solving problems and avoid recurrence by developing tools / prototypes to proactively monitor service performance and availability Working closely with product engineers to advocate reliable and scalable system design for supportability, resilience and reliability Leveraging skills in data analysis, network diagnostics and debugging tools to characterize performance and recommend improvements Engaging with our support, operations and engineering teams to investigate and troubleshoot complex problems, including incident management and post-mortem reviews Collaborating with internal teams to help trouble-shoot and resolve escalations and incidents for our customers

Requirements

Have 5 years of relevant experience and a Master's degree in Computer Science or its equivalent
Demonstrate experience in one of the scripting or procedural languages (python, perl, shell, C/C++, Java, etc.)
Possess experience working in a UNIX/Linux computing environment
Have experience with monitoring, alerting, and logging platforms such as Grafana.
Have in-depth understanding of computer networking concepts, Unix/Linux internals, distribution systems, and system design.
Have excellent communication and organizational skills, be able to articulate technical information in an easy to understand manner

Responsibilities

Monitoring, investigating, and analyzing performance and availability by (co)designing, managing, and tracking product-related SLIs/SLOs
Solving problems and avoid recurrence by developing tools / prototypes to proactively monitor service performance and availability
Working closely with product engineers to advocate reliable and scalable system design for supportability, resilience and reliability
Leveraging skills in data analysis, network diagnostics and debugging tools to characterize performance and recommend improvements
Engaging with our support, operations and engineering teams to investigate and troubleshoot complex problems, including incident management and post-mortem reviews
Collaborating with internal teams to help trouble-shoot and resolve escalations and incidents for our customers

Benefits

At Akamai, we will provide you with opportunities to grow, flourish, and achieve great things. Our benefit options are designed to meet your individual needs for today and in the future. We provide benefits surrounding all aspects of your life:
Your health
Your finances
Your family
Your time at work
Your time pursuing other endeavors
Our benefit plan options are designed to meet your individual needs and budget, both today and in the future.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume