DevOps Engineer III

Advisor360Needham, MA

About The Position

At Advisor360°, our Agentic AI team is building the platform layer that makes AI systems truly production-ready—and we’re already live in production. This isn’t a greenfield initiative; it’s a high-impact environment where real systems are running at scale today. As a DevOps Engineer, you’ll own the infrastructure that powers these systems. Working hands-on with Kubernetes, GitOps, and ArgoCD, you’ll design and operate the deployment framework that enables multiple teams to ship independently and efficiently. You’ll play a critical role in establishing operational standards, ensuring reliability, and building the foundation that allows AI-driven workflows to execute with confidence at scale.

Requirements

  • 5+ years operating Kubernetes in production.
  • Hands-on GitOps experience with ArgoCD: multi-environment setups, ApplicationSets, sync waves, health checks, and rollback under pressure.
  • Azure fluency: AKS, ACR, Azure Monitor, Key Vault, managed identity, workload identity, networking.
  • Infrastructure-as-code as a default: Terraform for everything, no console cowboys.
  • Scripting in Python, Go, or Bash for automation and tooling — not one-offs, maintained code.
  • Strong incident response instincts. You've been on-call, written postmortems, and fixed the underlying conditions rather than just the symptom.
  • Experience running LLM inference infrastructure or API gateway patterns for AI workloads.
  • Familiarity with agentic AI frameworks (LangGraph, AutoGen, or similar) and the infrastructure patterns they require.
  • OPA/Gatekeeper or other policy-as-code tooling for cluster governance at scale.
  • OpenTelemetry and distributed tracing across non-trivial service meshes.
  • Service mesh experience (Istio or Linkerd) for service-to-service auth and traffic management.
  • CKA or CKS certification.
  • Prior work on multi-tenant platforms where teams are both customers and contributors.

Responsibilities

  • Cluster operations on AKS: node pool sizing, autoscaling policies, namespace isolation, network policies, and day-two operational hygiene across environments.
  • GitOps delivery pipeline using ArgoCD: app-of-apps structure, environment promotion, rollback strategy, and the guardrails that prevent one team's bad deploy from cascading.
  • Deployment strategies: blue-green, canary, and rolling release patterns for agentic services where a bad rollout has downstream effects on active workflows.
  • Security posture: RBAC, Azure AD Workload Identity, network policies, secrets management via Key Vault, and policy-as-code enforcement with OPA/Gatekeeper.
  • Platform reliability: SLIs, SLOs, alerting, and runbooks for the infra layer. When something breaks at 2am, you write the playbook.
  • Developer experience: reduce the toil that slows down six feature teams. The right self-service primitives mean engineers spend time building skills, not waiting on infra tickets.
  • Cost and capacity management: LLM workloads have spiky, non-linear cost profiles. You'll instrument and enforce budgets, quotas, and rightsizing across the cluster.

Benefits

  • Competitive base salaries
  • Annual performance-based bonuses
  • Equity
  • Comprehensive health benefits, including dental, life, and disability insurance
  • Unlimited paid time off
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service