About The Position

We are building next-generation AI infrastructure from the ground up. Our mission is to deliver highly performant, reliable, and scalable network clusters purpose-built for large-scale AI training and inference. As a startup, we operate with urgency, ownership, and a bias toward action. We are assembling the foundational infrastructure that will power frontier AI workloads—and we’re looking for engineers who want to build it from zero to scale.

Requirements

  • 5–8+ years in infrastructure engineering, hardware deployment, or data center operations.
  • Hands-on experience deploying network servers (HGX/DGX or similar platforms).
  • Experience with high-speed networking (InfiniBand, RoCE, Ethernet fabrics).
  • Strong Linux systems knowledge.
  • Experience troubleshooting distributed systems performance issues.
  • Comfortable working onsite in data center environments as needed.

Nice To Haves

  • Experience in AI/ML infrastructure or HPC environments.
  • Familiarity with NCCL, CUDA, RDMA.
  • Automation experience (Python, Ansible, Terraform, Bash).
  • Experience in high-density power and cooling environments.

Responsibilities

  • Execute end-to-end bringup of network nodes and racks from installation to production readiness.
  • Validate BIOS/BMC/firmware configurations and network health.
  • Perform rack-level integration including power, cabling, and airflow validation.
  • Bring up and validate high-speed network fabrics (InfiniBand, RoCE, 100–400G Ethernet).
  • Configure and validate leaf/spine network connectivity.
  • Run cluster-wide burn-in and stress testing.
  • Validate node-to-node performance (NCCL, RDMA, GPUDirect).
  • Troubleshoot hardware, firmware, and fabric-level issues.
  • Contribute to automation for provisioning and cluster validation.
  • Improve deployment playbooks and documentation.
  • Identify reliability issues early and drive corrective actions.
  • Help turn ad hoc deployments into repeatable systems.
  • Work closely with networking, systems software, and data center teams.
  • Coordinate with hardware vendors to resolve bringup issues.
  • Support rapid capacity expansion as we scale.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service