Forward Deployed Validation Engineer, AI Infrastructure

Deploy TalentNew York, NY
Hybrid

About The Position

As a Forward Deployed Validation Engineer, you will be the domain expert who makes Atlas indispensable for our customers' datacenter and cluster validation workflows. You'll work at the intersection of deep technical expertise in system validation and cluster testing, customer engagement, and product development — using Atlas to solve real problems for hardware validation teams at leading companies while translating those workflows and insights back to our software and ML teams. Most validation engineers work inside discrete platforms, executing program by program. Here, your expertise will be the training signal that compounds Atlas’s intelligence for every customer. You'll own outcomes across the technical, product, and customer dimensions. If you've ever wanted your domain knowledge to scale beyond your direct work, this is how. A truly unique role!

Requirements

  • Elite datacenter validation expertise - 4+ years with AI/ML datacenter infrastructure, GPU cluster validation, or large-scale hardware validation at leading hardware companies or cloud providers; you're the person that hardware teams call to debug complex system issues.
  • Full-stack hardware debugging mastery - Deep understanding of GPU/CPU architecture, memory subsystems, BIOS/UEFI/BMC firmware, high-speed interconnects (PCIe/CXL/InfiniBand/RoCE), NVMe storage, and power/thermal management; experience validating systems from deployment through production at node and cluster scale; proven track record debugging issues across hardware, firmware, drivers, and software in distributed ML infrastructure.
  • Performance optimization at scale - Strong experience benchmarking and tuning GPU clusters at multiple scales (cluster/rack/node); expertise with profiling tools, GPU utilization patterns, memory bandwidth bottlenecks, interconnect performance, and distributed training efficiency.
  • Customer-facing technical leadership - You earn trust through technical credibility, understand workflows and pain points, communicate complex concepts clearly, and build strong relationships.
  • Automation & software engineering skills - Proficiency in Python, Bash, or similar for building validation frameworks and automating tests at scale; comfortable with APIs, CI/CD environments, and collaborating with software engineers to productize workflows.
  • Platform expertise - Experience with AMD and / or NVIDIA HW and Software stacks - EPYC CPUs, Instinct GPUs, ROCm software stack, or AMD networking technologies, and/or NVIDIA Grace CPUs, H100/B200/GB200 GPUs, CUDA/cuDNN/NCCL/TensorRT software stack and InfiniBand/NVLink networking technologies.
  • Willingness to travel domestically and internationally (30-40% of your time) to deploy with customers and validate hardware in the field.

Nice To Haves

  • Work in person at Arena Physica’s NYC headquarters when not deployed.

Responsibilities

  • Be the validation & performance expert - Execute datacenter validation and cluster performance testing across GPU/CPU/memory/BIOS/BMC/networking/storage subsystems; benchmark, profile, and optimize system and cluster performance; debug complex hardware/firmware/software interactions and drive root-cause analysis.
  • Deploy Atlas with customers - Embed at customer sites to validate datacenter hardware using Atlas as your primary tool, augmenting with your own expertise where needed. Build credibility through technical depth and results.
  • Codify and scale — Your value here isn’t just what you fix in the field — it’s what you teach Atlas. Establish validation methodologies for Atlas across common subsystems and testing phases (EVT, DVT, PVT). Alongside these, translate customer workflows and pain points into product requirements and work closely with our engineering team to encode that expertise into Atlas. Every deployment should compound value for Atlas more broadly.

Benefits

  • 100% of the monthly premium for Aetna medical insurance, plus vision and dental coverage
  • 401(k) Retirement Plan
  • Unlimited PTO
  • Lunch every day from local restaurants via Sharebite
  • Relocation support provided
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service