AI Infrastructure Engineer

Percepta•New York, NY

57d

About The Position

We're hiring an AI Infrastructure Engineer to own the infrastructure, deployment, and operational reliability that powers Percepta's AI systems, including the autonomous agents at the core of what we ship. Part of the work is hardening what exists: tightening our Terraform footprint, strengthening deployment pipelines, bringing more rigor to how we manage infrastructure across regions and providers. Part of it is building what's missing. And part of it is genuinely new territory, figuring out what SRE means when the systems you're operating make autonomous decisions. The infrastructure patterns for the agentic systems of the future don't exist yet. You'll help define them. Why this is different You're deploying autonomous systems. The infrastructure contract changes when your workloads have agency. Observability means understanding why an agent made a decision, not just whether a pod is healthy. The gap between research and production is real here. Our teams move optimization algorithms and AI systems from research environments into production, and you'll be part of that handoff. MLOps experience isn't required, but you'll be closer to that boundary than most infra roles. Small team. Real ownership. You're making foundational decisions, not inheriting someone else's.

Requirements

5+ years building and operating production infrastructure in DevOps or SRE roles
The kind of engineer who sees a manual process and can't rest until it's automated well, not just scripted
Strong hands-on Terraform experience
Deep experience with at least 1 major cloud provider (AWS, GCP, or Azure): networking, IAM, cost management, the operational realities of production workloads
Solid Docker and Kubernetes experience in production. We run managed clusters across all 3 major clouds; this is a core part of the role
Experience designing and maintaining CI/CD pipelines (GitHub Actions, GitLab CI, or similar)
Scripting proficiency in Python, Bash, or similar
High agency: you don't wait for a ticket to fix what's broken, but you communicate, collaborate, and bring the team along
Genuine curiosity about AI systems, not just the infrastructure running them. You want to understand what you're operating
You find it interesting (not alarming) that some systems you'll operate will be making decisions on their own

Nice To Haves

Multi-region and multi-cloud experience across 2+ providers
Experience with single-tenant or on-prem deployments alongside multi-tenant SaaS
Familiarity with GitOps patterns and progressive delivery
Familiarity with the Grafana stack (Prometheus, Grafana, Loki) or equivalent
Experience with compliance frameworks (HIPAA, SOC 2) and how they shape infrastructure decisions in regulated environments
Background supporting ML or research workflows moving to production: model deployment, pipeline orchestration, or similar
You've thought about what observability means for non-deterministic systems and have opinions about it

Responsibilities

Define infrastructure patterns for multi-agent systems that need to be observable, controllable, and recoverable in ways traditional apps don't require
Own and evolve our IaC stack: Terraform and Kubernetes across AWS, GCP, and Azure
Build observability primitives for agentic workflows, tracing agent decisions and execution paths, not just service latency and pod health
Design and maintain CI/CD pipelines that give teams fast, trustworthy feedback from commit to production
Build operational foundations: monitoring, alerting, incident response, and the new patterns that emerge when AI systems are participants in that response
Work across engineering teams to meet the reliability and compliance requirements of the institutions we serve (SOC 2, HIPAA, regulated environments in healthcare and energy)

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume