AI Infrastructure Engineer

PerceptaNew York, NY
3d

About The Position

We're hiring an AI Infrastructure Engineer to own the infrastructure, deployment, and operational reliability that powers Percepta's AI systems, including the autonomous agents at the core of what we ship. Part of the work is hardening what exists: tightening our Terraform footprint, strengthening deployment pipelines, bringing more rigor to how we manage infrastructure across regions and providers. Part of it is building what's missing. And part of it is genuinely new territory, figuring out what SRE means when the systems you're operating make autonomous decisions. The infrastructure patterns for the agentic systems of the future don't exist yet. You'll help define them. Why this is different You're deploying autonomous systems. The infrastructure contract changes when your workloads have agency. Observability means understanding why an agent made a decision, not just whether a pod is healthy. The gap between research and production is real here. Our teams move optimization algorithms and AI systems from research environments into production, and you'll be part of that handoff. MLOps experience isn't required, but you'll be closer to that boundary than most infra roles. Small team. Real ownership. You're making foundational decisions, not inheriting someone else's.

Requirements

  • 5+ years building and operating production infrastructure in DevOps or SRE roles
  • The kind of engineer who sees a manual process and can't rest until it's automated well, not just scripted
  • Strong hands-on Terraform experience
  • Deep experience with at least 1 major cloud provider (AWS, GCP, or Azure): networking, IAM, cost management, the operational realities of production workloads
  • Solid Docker and Kubernetes experience in production. We run managed clusters across all 3 major clouds; this is a core part of the role
  • Experience designing and maintaining CI/CD pipelines (GitHub Actions, GitLab CI, or similar)
  • Scripting proficiency in Python, Bash, or similar
  • High agency: you don't wait for a ticket to fix what's broken, but you communicate, collaborate, and bring the team along
  • Genuine curiosity about AI systems, not just the infrastructure running them. You want to understand what you're operating
  • You find it interesting (not alarming) that some systems you'll operate will be making decisions on their own

Nice To Haves

  • Multi-region and multi-cloud experience across 2+ providers
  • Experience with single-tenant or on-prem deployments alongside multi-tenant SaaS
  • Familiarity with GitOps patterns and progressive delivery
  • Familiarity with the Grafana stack (Prometheus, Grafana, Loki) or equivalent
  • Experience with compliance frameworks (HIPAA, SOC 2) and how they shape infrastructure decisions in regulated environments
  • Background supporting ML or research workflows moving to production: model deployment, pipeline orchestration, or similar
  • You've thought about what observability means for non-deterministic systems and have opinions about it

Responsibilities

  • Define infrastructure patterns for multi-agent systems that need to be observable, controllable, and recoverable in ways traditional apps don't require
  • Own and evolve our IaC stack: Terraform and Kubernetes across AWS, GCP, and Azure
  • Build observability primitives for agentic workflows, tracing agent decisions and execution paths, not just service latency and pod health
  • Design and maintain CI/CD pipelines that give teams fast, trustworthy feedback from commit to production
  • Build operational foundations: monitoring, alerting, incident response, and the new patterns that emerge when AI systems are participants in that response
  • Work across engineering teams to meet the reliability and compliance requirements of the institutions we serve (SOC 2, HIPAA, regulated environments in healthcare and energy)

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

1-10 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service