L2 Support Engineer

Blitzy•Cambridge, MA

4d•$100,000 - $140,000•Onsite

About The Position

The role is to support our clients and ensure a stable environment across the full lifecycle: installation, ongoing upgrades, and day-to-day operation. The L2 Support Engineer works alongside L1 to triage and resolve issues, and escalates unresolved defects to engineering. It operates across Kubernetes, Docker, and the major cloud providers.

Requirements

Distributed-systems debugging. Reason about a request crossing multiple services, queues, and network hops, and isolate which hop failed. You debug by forming a hypothesis and confirming it with evidence (logs, pod state, queue depth, DB rows), not by guessing.
Kubernetes & Docker.
Major cloud providers: GCP, AWS, and Azure. Hands-on with at least one deeply and able to work across the others: managed Kubernetes (GKE/AKS), cloud logging, IAM/auth basics, and cloud disk/storage behavior.
Strong monitoring & observability practice. Fluent with an APM/observability stack (Datadog or equivalent): log queries, correlating across services by request/trace IDs, reading traces, and building dashboards and alerts. You reach for the data before theorizing.
Methodical, evidence-first temperament. Hold several candidate causes at once, run the cheapest disconfirming check first, and never claim a root cause or fix you haven't proven.
Multi-tenant safety mindset. Environments are shared and customer-owned: default to read-only diagnostics and understand blast radius before changing anything.

Nice To Haves

Python and Redis literacy.
Basic message queueing. Command transport runs over a message queue (Redis/rq). Comfort inspecting queue depth, backlogs, and stuck/failed jobs; concepts transfer from any broker.
Networking & WebSockets. Many of our hardest issues are connection problems: WebSocket/Socket.IO drops, NAT/idle/LB timeouts, half-open sockets, DNS-vs-routing, TLS. Tell a transport fault from an application fault.
SQL / PostgreSQL. Query operational tables to confirm what the system recorded.
Source-control platforms. GitHub (incl. GitHub Enterprise Server), Azure DevOps, and/or GitLab, clone/push/pull, access tokens, app credentials, and their failure modes.
CI/CD, Helm & deploy integrity. Many "sudden regressions" are a bad or partial deploy: check what version is actually running before chasing architecture theories. Helm and container deploy pipelines expected. ArgoCD is a plus.
Secrets management. Comfort handling secrets, credentials, and certificates safely, ideally with Vault (strongly preferred).
Linux and Windows. Workloads run on both; comfort triaging on each OS (process inspection, filesystem, basic networking).
Incident management & ticketing workflows: Jira or similar (a plus).
Prior customer-facing support or SRE/on-call experience (a plus).

Responsibilities

Deploy and install the platform into customer environments, and troubleshoot installation issues.
Support ongoing upgrades and day-to-day operation, keeping customer environments stable.
Work alongside L1 to triage and resolve customer-reported issues, driving them to resolution or escalation.
Diagnose failures across the stack: compute, networking, storage, and the services running on it.
Reproduce issues safely against live (often multi-tenant) environments using read-only diagnostics first.
Build and maintain dashboards, monitors, and runbooks so recurring issues get faster to fix: or stop recurring.
Write up clear, evidence-backed escalations and post-incident notes.
Communicate status and resolution to customers clearly and on time.