DevOps Engineer III

Advisor360•Needham, MA

10h

About The Position

At Advisor360°, our Agentic AI team is building the platform layer that makes AI systems truly production-ready—and we’re already live in production. This isn’t a greenfield initiative; it’s a high-impact environment where real systems are running at scale today. As a DevOps Engineer, you’ll own the infrastructure that powers these systems. Working hands-on with Kubernetes, GitOps, and ArgoCD, you’ll design and operate the deployment framework that enables multiple teams to ship independently and efficiently. You’ll play a critical role in establishing operational standards, ensuring reliability, and building the foundation that allows AI-driven workflows to execute with confidence at scale.

Requirements

5+ years operating Kubernetes in production.
Hands-on GitOps experience with ArgoCD: multi-environment setups, ApplicationSets, sync waves, health checks, and rollback under pressure.
Azure fluency: AKS, ACR, Azure Monitor, Key Vault, managed identity, workload identity, networking.
Infrastructure-as-code as a default: Terraform for everything, no console cowboys.
Scripting in Python, Go, or Bash for automation and tooling — not one-offs, maintained code.
Strong incident response instincts. You've been on-call, written postmortems, and fixed the underlying conditions rather than just the symptom.
Experience running LLM inference infrastructure or API gateway patterns for AI workloads.
Familiarity with agentic AI frameworks (LangGraph, AutoGen, or similar) and the infrastructure patterns they require.
OPA/Gatekeeper or other policy-as-code tooling for cluster governance at scale.
OpenTelemetry and distributed tracing across non-trivial service meshes.
Service mesh experience (Istio or Linkerd) for service-to-service auth and traffic management.
CKA or CKS certification.
Prior work on multi-tenant platforms where teams are both customers and contributors.

Responsibilities

Cluster operations on AKS: node pool sizing, autoscaling policies, namespace isolation, network policies, and day-two operational hygiene across environments.
GitOps delivery pipeline using ArgoCD: app-of-apps structure, environment promotion, rollback strategy, and the guardrails that prevent one team's bad deploy from cascading.
Deployment strategies: blue-green, canary, and rolling release patterns for agentic services where a bad rollout has downstream effects on active workflows.
Security posture: RBAC, Azure AD Workload Identity, network policies, secrets management via Key Vault, and policy-as-code enforcement with OPA/Gatekeeper.
Platform reliability: SLIs, SLOs, alerting, and runbooks for the infra layer. When something breaks at 2am, you write the playbook.
Developer experience: reduce the toil that slows down six feature teams. The right self-service primitives mean engineers spend time building skills, not waiting on infra tickets.
Cost and capacity management: LLM workloads have spiky, non-linear cost profiles. You'll instrument and enforce budgets, quotas, and rightsizing across the cluster.