Machine Learning Infrastructure Engineer

UniversalAGI

103d•Onsite

About The Position

UniversalAGI is hiring an Infrastructure Engineer to build and own the execution platform powering our research and customer deployments: data generation + simulation orchestration + training/fine-tuning infrastructure + benchmarking pipelines + production deployments in customer environments. You’ll work closely with the CEO and founding team to turn research into repeatable, scalable, reliable systems - internally and in customer infrastructure. This is a “ship outcomes” role: your work directly determines how fast we can iterate, how reproducible our results are, and how reliably we deliver in production.

Requirements

Strong software engineering skills (clean code, debugging, reliability, reproducibility).
Hands-on experience building/operating infrastructure for ML/compute-heavy workflows: pipelines, job orchestration, GPU compute, storage, CI/CD, monitoring.
Olympic athlete mindset: You have high standards for yourself and are obsessed with measurable improvement on the metrics you are delivering to customers.
Resourcefulness: you know when to do the “quick & correct” fix vs. when to invest in a robust solution, and you can justify the tradeoff with impact/
Ownership: Comfortable owning work end-to-end and being accountable for measurable outcomes.

Nice To Haves

Experience with workflow orchestration (e.g., Ray, Kubernetes, Slurm).
Experience with GPU infrastructure and distributed training systems.
Experience building evaluation/benchmarking frameworks with strong reproducibility guarantees.
Experience deploying into regulated / security-sensitive environments (gov/defense/enterprise).
Experience with simulation/HPC pipelines (CFD, meshing, batch workloads) is a plus but not required.
Experience in an FDE-style / delivery execution role (or similar “ship results fast” environments).

Responsibilities

Build the foundation platform (internal)
Build and operate scalable infrastructure for data generation and simulation workflows (job orchestration, scheduling, queues, retries, observability).
Build reproducible pipelines for training/fine-tuning and benchmarking (artifact/version management, experiment tracking, dataset lineage).
Own cost/performance tradeoffs across compute, storage, networking, and runtime efficiency.
Deploy to customers (external)
Lead deployments of our stack into customer cloud/on-prem environments, including secure networking, permissions, and data movement.
Build robust deployment patterns: environment provisioning, CI/CD, rollbacks, monitoring, and incident response.
Partner with customers to ensure reliability and repeatability under real-world constraints (security, compliance, infra limits, data governance).