Senior Site Reliability Engineer

Blitzy•Cambridge, MA

4d•Onsite

About The Position

As a Senior Site Reliability Engineer at Blitzy's Kendall Square headquarters, you will be a foundational force behind the reliability, scalability, and operational excellence of our AI-powered software development platform. Sitting at the intersection of software engineering and infrastructure, you'll ensure that the systems enabling enterprise customers to autonomously build production-ready software remain performant, resilient, and always available. This is a high-ownership, high-impact role for an engineer who operates with urgency, thinks in systems, and takes pride in building infrastructure that doesn't break.

Requirements

5–8 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering.
Deep expertise in Kubernetes — cluster management, workload deployment, scaling strategies, and troubleshooting in production.
Strong proficiency with at least one major cloud platform (AWS preferred); experience designing and operating distributed, high-availability systems.
Hands-on Terraform experience for infrastructure-as-code provisioning and management.
Proven ability to define and operationalize SLOs, SLAs, and incident response processes.
Strong scripting and automation skills in Python, Go, or Bash.
Experience designing and maintaining comprehensive observability systems across complex, multi-service environments.
Excellent cross-functional communication skills — able to partner with software engineers, product teams, and leadership equally well.

Nice To Haves

Experience operating infrastructure for AI or ML workloads, including GPU scheduling or model serving infrastructure.
Familiarity with MLOps tooling (MLflow, Kubeflow, or similar) and the operational challenges unique to AI-driven services.
Knowledge of service mesh technologies (Istio, Linkerd) and advanced networking patterns.
CKA (Certified Kubernetes Administrator) certification or equivalent demonstrated expertise.
Prior experience at a high-growth startup where you built reliability foundations from the ground up.
A track record of influencing engineering culture — not just fixing infrastructure, but raising the bar for how teams think about reliability.

Responsibilities

Design, build, and operate highly available, fault-tolerant infrastructure across cloud environments supporting Blitzy's AI platform and enterprise customers.
Define and own SLOs, SLAs, and error budgets for critical services; lead blameless postmortems and drive systemic improvements that prevent recurrence.
Build and maintain robust CI/CD pipelines, release automation, and deployment infrastructure that empower engineers to ship with speed and safety.
Own the full observability stack — logging, metrics, distributed tracing, and alerting (e.g., Prometheus, Grafana, Datadog, OpenTelemetry).
Manage Kubernetes clusters and container infrastructure supporting AI agent workloads and production application services.
Drive infrastructure-as-code practices using Terraform; ensure all provisioning is automated, auditable, and version-controlled.
Partner with engineering teams at HQ to embed reliability and operational best practices early in the development lifecycle.
Lead capacity planning, performance benchmarking, and cloud cost optimization as the platform scales.