DevOps Engineer

Blitzy•Cambridge, MA

176d•Onsite

About The Position

As a DevOps Engineer at Blitzy, you will be a critical force behind the infrastructure powering our cutting-edge AI agents and enterprise software development platform. Based out of our Cambridge, MA headquarters, you'll architect and maintain the scalable, resilient systems that enable Blitzy to autonomously deliver production-ready software at unprecedented speed. This is a high-impact, hands-on role where your work directly shapes the reliability and performance of a platform used by Fortune 500 companies.

Requirements

5–8 years of DevOps or infrastructure engineering experience in production environments.
Deep expertise in Kubernetes — including deployment, scaling, networking, and troubleshooting.
Strong Python proficiency for automation, scripting, and tooling.
Hands-on experience with Helm for application package management.
Proven track record designing and maintaining CI/CD pipelines.
Experience with major cloud platforms (AWS, Azure, or GCP).
Proficiency with Terraform for Infrastructure as Code.
Strong Linux administration skills and containerization expertise (Docker).

Nice To Haves

CKA (Certified Kubernetes Administrator) certification.
Experience with MLOps tooling such as MLflow, Kubeflow, or similar platforms.
Background in microservices architecture and service mesh technologies.
Familiarity with API gateway management and advanced service mesh configurations.
A bias for automation — if you've done something manually twice, you've already started scripting it.
Passion for AI infrastructure and excitement about building systems at the frontier of what's technically possible.

Responsibilities

Build, manage, and scale Kubernetes clusters supporting AI agent workloads and production application deployments.
Design and implement robust CI/CD pipelines for both application services and AI-driven workflows.
Automate infrastructure provisioning, scaling, and operations using Python and Terraform.
Deploy and maintain applications via Helm charts, ensuring consistency across environments.
Own the observability stack: alerting, distributed tracing, and monitoring for all production services and APIs.
Build and maintain infrastructure for AI agent orchestration, enabling reliable and high-throughput agent execution.
Partner closely with engineering teams to improve developer experience, deployment strategies, and operational tooling.
Maintain and continuously improve the security, reliability, and cost-efficiency of our cloud environments.