Senior Site Reliability Engineer

AkamaiCambridge, MA
Remote

About The Position

The Akamai Inference Cloud team is part of Akamai's Cloud Technology Group. We design, implement, deploy and operate AI platforms that enable customers to run inference models and developers to create AI applications with unmatched performance, compliance, and economics. In this role, you'll own reliability workstreams for Akamai's serverless inference platform, build automation and tooling, and contribute to architecture and operational decisions. Opportunities exist to take ownership of critical reliability problems end-to-end, partner with product engineering teams, and develop expertise in GPU infrastructure, Kubernetes at scale, and AI inference workloads.

Requirements

  • 5+ years of experience in SRE, infrastructure engineering, or platform engineering, working with large-scale distributed systems
  • Have extensive experience with Kubernetes and containerization at scale
  • Have experience defining SLOs and working with observability tools such as Prometheus, Grafana, and distributed tracing
  • Possess coding ability in Python or Go for automation and tooling, with experience in CI/CD pipelines, deployment safety, and infrastructure-as-code
  • Possess the ability to take ownership of problems and drive them to resolution independently

Nice To Haves

  • Interest in or experience with AI/ML infrastructure, model serving, or GPU workloads

Responsibilities

  • Building and maintaining observability for AI workloads, including telemetry, dashboards, alerts, SLO/SLI tracking, and driving improvements when targets are missed
  • Writing automation and tooling to reduce operational toil, improve deployment safety, and accelerate incident response
  • Integrating AI workloads into Akamai's existing incident management processes, building runbooks, participating in on-call rotations, and conducting blameless post-mortems
  • Building and maintaining CI/CD integrations, deployment safety checks, and rollback automation
  • Collaborating with product engineering teams to improve reliability, contribute to architecture decisions, and ensure operational readiness for product releases
  • Contributing to capacity planning, autoscaling configuration, and workload scheduling for AI compute infrastructure

Benefits

  • Akamai provides industry-leading benefits including healthcare, 401K savings plan, company holidays, vacation (in the form of PTO), sick time, family friendly benefits including parental leave and an employee assistance program including a focus on mental and financial wellness
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service