About The Position

Do you like collaborating across teams to solve complex problems? Do you enjoy solving large scale distributed systems problems? Join the Mapping SRE team! The Mapping SRE team is responsible for overseeing and improving availability, reliability, performance and change management procedures of Akamai's mapping system. Our system routes trillions of client requests per day, controlling tens of terabits per second of content traffic served to clients worldwide. Our team defines KPIs, advances the state of measurements, monitoring dashboards, alerts, and investigates complex production issues. Partner with the best In this role, you'll work closely with cross-functional teams to understand and improve the performance, availability and reliability of Akamai's Mapping Service. You'll define key performance indicators (KPIs), advance the state of monitoring, alerting and operational responses, and investigate complex performance issues. As a Senior Site Reliability Engineer, you will be responsible for: Monitoring, investigating, and analyzing performance and availability by (co)designing, managing, and tracking product-related SLIs/SLOs Solving problems and avoid recurrence by developing tools / prototypes to proactively monitor service performance and availability Working closely with product engineers to advocate reliable and scalable system design for supportability, resilience and reliability Leveraging skills in data analysis, network diagnostics and debugging tools to characterize performance and recommend improvements Engaging with our support, operations and engineering teams to investigate and troubleshoot complex problems, including incident management and post-mortem reviews Collaborating with internal teams to help trouble-shoot and resolve escalations and incidents for our customers

Requirements

  • Have 5 years of relevant experience and a Master's degree in Computer Science or its equivalent
  • Demonstrate experience in one of the scripting or procedural languages (python, perl, shell, C/C++, Java, etc.)
  • Possess experience working in a UNIX/Linux computing environment
  • Have experience with monitoring, alerting, and logging platforms such as Grafana.
  • Have in-depth understanding of computer networking concepts, Unix/Linux internals, distribution systems, and system design.
  • Have excellent communication and organizational skills, be able to articulate technical information in an easy to understand manner

Responsibilities

  • Monitoring, investigating, and analyzing performance and availability by (co)designing, managing, and tracking product-related SLIs/SLOs
  • Solving problems and avoid recurrence by developing tools / prototypes to proactively monitor service performance and availability
  • Working closely with product engineers to advocate reliable and scalable system design for supportability, resilience and reliability
  • Leveraging skills in data analysis, network diagnostics and debugging tools to characterize performance and recommend improvements
  • Engaging with our support, operations and engineering teams to investigate and troubleshoot complex problems, including incident management and post-mortem reviews
  • Collaborating with internal teams to help trouble-shoot and resolve escalations and incidents for our customers

Benefits

  • At Akamai, we will provide you with opportunities to grow, flourish, and achieve great things. Our benefit options are designed to meet your individual needs for today and in the future. We provide benefits surrounding all aspects of your life:
  • Your health
  • Your finances
  • Your family
  • Your time at work
  • Your time pursuing other endeavors
  • Our benefit plan options are designed to meet your individual needs and budget, both today and in the future.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service