Senior Site Reliability Engineer

Akamai•United States,

3d•$121,400 - $218,600•Hybrid

About The Position

The AI Hardware SRE team is responsible for overseeing, scaling, and optimizing our next-generation dedicated AI hardware infrastructure. You will be responsible for ensuring best-in-class uptime and reliability of our AI hardware infrastructure offerings. In this role, you'll play a part in pioneering the reliability an elite, high-density hardware and software infrastructure spanning the globe. You'll collaborate with product teams from the earliest stages of development to ensure the reliability, scalability, and performance of our systems. You'll define key performance indicators and defend them when they are breached.

Requirements

5 years of relevant experience and a Bachelor's degree in Computer Engineering, Computer Science or equivalent
Possess tooling and coding ability in languages like Python to construct scalable operational tools, API integrations, and automation frameworks.
Show hands-on experience with modern observability stacks and timeseries engines, like Prometheus, Grafana, OpenTelemetry, and Loki.
Possess a working understanding of advanced networking topologies, high-bandwidth routing/switching infrastructure, BGP, and dual-stack IPv4/IPv6 networks.
Have experience acting as a key designer for new service rollouts, including establishing operational readiness criteria, telemetry baselines, and alerting thresholds.
Demonstrate extensive experience building technical runbooks, leading complex incident response bridges, and driving comprehensive, blameless post-mortems.
Display a proven ability to take absolute ownership of ambiguous technical problems, coordinate cross-functional teams, and drive for production-grade solutions.

Responsibilities

Developing and scaling robust programmatic tooling and infrastructure-as-code utilities in Python to eliminate operational toil and automate fleet-wide provisioning.
Integrating automated workflows across disconnected corporate ticketing systems to optimize time-to-mitigate metrics for hardware and network break-fix events.
Leveraging advanced AI utilities and LLM-assisted development paradigms where appropriate to accelerate technical execution, script authorship, and system analysis
Working on cutting-edge private cloud and compute technologies to improve the availability, latency, and overall systemic health of high-density hardware environments.
Designing and implementing telemetry pipelines, custom Prometheus/Grafana monitoring dashboards, and AI-based anomaly detection tailored for bare-metal and virtualized environments.
Participating in 24x7x365 on-call rotations, spearheading real-time incident management, and managing high-severity service disruption protocols via automated PagerDuty and Slack workflows.
Partnering directly with third-party infrastructure vendors and coordinating on-site field technicians to facilitate uptime activities.