Senior Site Reliability Engineer

BlitzyCambridge, MA
Onsite

About The Position

As a Senior Site Reliability Engineer at Blitzy's Kendall Square headquarters, you will be a foundational force behind the reliability, scalability, and operational excellence of our AI-powered software development platform. Sitting at the intersection of software engineering and infrastructure, you'll ensure that the systems enabling enterprise customers to autonomously build production-ready software remain performant, resilient, and always available. This is a high-ownership, high-impact role for an engineer who operates with urgency, thinks in systems, and takes pride in building infrastructure that doesn't break.

Requirements

  • 5–8 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering.
  • Deep expertise in Kubernetes — cluster management, workload deployment, scaling strategies, and troubleshooting in production.
  • Strong proficiency with at least one major cloud platform (AWS preferred); experience designing and operating distributed, high-availability systems.
  • Hands-on Terraform experience for infrastructure-as-code provisioning and management.
  • Proven ability to define and operationalize SLOs, SLAs, and incident response processes.
  • Strong scripting and automation skills in Python, Go, or Bash.
  • Experience designing and maintaining comprehensive observability systems across complex, multi-service environments.
  • Excellent cross-functional communication skills — able to partner with software engineers, product teams, and leadership equally well.

Nice To Haves

  • Experience operating infrastructure for AI or ML workloads, including GPU scheduling or model serving infrastructure.
  • Familiarity with MLOps tooling (MLflow, Kubeflow, or similar) and the operational challenges unique to AI-driven services.
  • Knowledge of service mesh technologies (Istio, Linkerd) and advanced networking patterns.
  • CKA (Certified Kubernetes Administrator) certification or equivalent demonstrated expertise.
  • Prior experience at a high-growth startup where you built reliability foundations from the ground up.
  • A track record of influencing engineering culture — not just fixing infrastructure, but raising the bar for how teams think about reliability.

Responsibilities

  • Design, build, and operate highly available, fault-tolerant infrastructure across cloud environments supporting Blitzy's AI platform and enterprise customers.
  • Define and own SLOs, SLAs, and error budgets for critical services; lead blameless postmortems and drive systemic improvements that prevent recurrence.
  • Build and maintain robust CI/CD pipelines, release automation, and deployment infrastructure that empower engineers to ship with speed and safety.
  • Own the full observability stack — logging, metrics, distributed tracing, and alerting (e.g., Prometheus, Grafana, Datadog, OpenTelemetry).
  • Manage Kubernetes clusters and container infrastructure supporting AI agent workloads and production application services.
  • Drive infrastructure-as-code practices using Terraform; ensure all provisioning is automated, auditable, and version-controlled.
  • Partner with engineering teams at HQ to embed reliability and operational best practices early in the development lifecycle.
  • Lead capacity planning, performance benchmarking, and cloud cost optimization as the platform scales.

Benefits

  • Meaningful equity
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service