Senior DevOps Engineer (AI Ops)

AdobeSan Jose, CA

About The Position

We are seeking a hands-on Senior DevOps Engineer specializing in AI Ops to own infrastructure provisioning, CI/CD automation, telemetry pipelines, and production deployment for AI-powered services, agents, and orchestration systems. This role is responsible for building and operating the infrastructure that enables reliable, observable, and scalable AI systems in production. The engineer will help operationalize AI platforms by implementing intelligent monitoring, automated incident response, model lifecycle governance, and data-driven operational insights. The role is SRE-heavy and infrastructure-first, with responsibility for ensuring that systems and services using advanced technology running in production are reliable, resilient, scalable, secured, and cost-effective.

Requirements

  • Infrastructure as Code (Terraform, etc.)
  • Kubernetes clusters
  • CI/CD pipelines
  • Containerized environments
  • Model and agent performance monitoring
  • Scalable pipelines for collecting and processing logs, metrics, traces, and operational events
  • Structured telemetry for AI services and orchestration systems

Responsibilities

  • Design and manage cloud infrastructure using Infrastructure as Code (Terraform, etc.)
  • Provision and maintain Kubernetes clusters and supporting services
  • Automate environment setup across dev, stage, and production
  • Build and maintain CI/CD pipelines for AI Services, Agent Frameworks, Orchestrators, and Model Artifacts
  • Implement automated testing and reliability validation gates
  • Build safe rollback mechanisms for services and models
  • Integrate reliability and health checks into deployment workflows
  • Package, version, and deploy models and agent services in containerized environments while managing artifact promotion across environments.
  • Monitor model and agent performance (latency, throughput, accuracy, cost) and enable safe rollout, rollback, and refresh workflows.
  • Design and operate scalable pipelines for collecting and processing logs, metrics, traces, and operational events.
  • Enable structured telemetry for AI services and orchestration systems to support real-time monitoring and operational insights.
  • Integrate AIOps Platform
  • Ensure Production Reliability & SRE Excellence

Benefits

  • Comprehensive benefits programs
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service