AI Solutions Engineer

Hydra HostMiami, FL
19hRemote

About The Position

About Hydra HostHydra Host is a Founders Fund–backed NVIDIA cloud partner building the infrastructure platform that powers AI at scale. We connect AI Factories — high-performance GPU data centers — with the teams that depend on them: research labs training foundation models, enterprises running production inference, and developer platforms demanding scalable compute capacity.We operate where hardware meets software — the bare metal layer where reliability, performance, and speed matter most. As AI workloads evolve faster than traditional cloud infrastructure can adapt, Hydra is building the foundation layer that makes it all possible.The RoleAs an AI Solutions Engineer, you’ll ensure our AI Platform and Enterprise customers have an exceptional technical experience from first deployment to scale. You’ll work at the intersection of customer enablement, infrastructure engineering, AI performance optimization, and developer enablement — helping teams build reliable, high-performance AI platforms on top of Hydra.This role is about building the future before our customers ask for it. You’ll prototype and validate proof-of-concept neo-clouds on top of Hydra — standing up real AI platforms, running real workloads, and uncovering sharp edges before our AI Platform customers ever see them. These POCs exist to demonstrate customer value & pressure-test Hydra’s infrastructure, APIs, SDKs, and workflows.What you learn becomes product improvements, best practices, reference architectures, SDK's, template, and default configurations across Hydra.You’ll also serve as a technical face of Hydra to the developer ecosystem — showing what’s possible, how to do it, and why it matters.

Requirements

  • NVIDIA GPU Stack — Deep knowledge of NVIDIA hardware (drivers, firmware, NVLink, NCCL, CUDA, libraries), and how stack compatibility impacts performance.
  • Bare Metal Linux — Strong experience in bare-metal Linux systems administration, driver stacks, and kernel options to use.
  • AI Workloads — Proficiency running many various Hugging Face, PyTorch, model deployment frameworks, vLLM, and large-scale inference/training.
  • AI Benchmarking - Hands-on experience benchmarking AI workloads like Megatron, etc.
  • Workload Orchestration — Experience running Kubernetes clusters (CAPI), Slurm, and Ansible tools for cluster automation and workload management.
  • Scripting - Solid scripting skills (e.g., shell scripts, Perl, Ruby, Python)
  • Networking — OSI Layer 2/Layer 3 fundamentals (TCP/IP, DNS), VLANs, Bonding
  • East / West — RoCE or Infiniband familiarity.
  • Observability and Monitoring - nvidia-smi profiling, Prometheus/ Grafana or ELK stack
  • Container Runtimes - Containers like Docker, Podman, Singularity
  • Cloud Provisioning - Terraform, Cloud-init, etc.

Nice To Haves

  • HPC Clusters - Experience in HPC or large distributed training environments.
  • TEE - Familiarity with Trusted Execution Environments, Intel TDX, or Confidential Compute.
  • Storage Systems - familiarity with local and distributed file systems: NVMe, NFS, RAID, distributed file systems, CEPH, WEKA, VAST, DDN storage, etc.
  • BMC Provisioning - MaaS, iPXE, IPMI

Responsibilities

  • Prototype and operate proof-of-concept AI platforms and neo-clouds on top of Hydra using the Brokkr API — to validate the developer experience.
  • Build and maintain an open-source “neo-cloud in a box” reference implementation that demonstrates multi-tenancy, spin servers up & down based demand, expose containerized or virtualized GPU access
  • Dogfood Hydra’s API's and infrastructure, and tooling to continuously, find gaps, sharp edges, and failure modes before customers do, and working with product and engineering to resolve them.
  • Work closely with the API and monetization teams by incorporating direct customer feedback into feature prioritization, pricing models, and API design.
  • Run and validate the latest AI platforms, inference stacks, and orchestration frameworks on Hydra to ensure first-class support.
  • Collaborate closely with product and engineering to turn learnings into productized workflows, defaults, automations.
  • Create targeted provisioning templates (e.g., self-managed Kubernetes, specialized inference engines, custom OS images) by researching common software stacks, licenses, and dependencies used by AI platforms
  • Provide developers with high-quality technical enablement: code samples, SDK contributions, reference implementations, and clear documentation.
  • Act as a technical voice for Hydra’s developer ecosystem: host webinars, write technical content, run demos, participate in events, and support hackathons showcasing what’s possible on Hydra.
  • Document best practices and standardize configurations to scale customer success globally.

Benefits

  • Equity ownership — Meaningful stake in what we’re building together.
  • Competitive salary — We pay fairly and transparently.
  • Healthcare coverage — Medical, dental, vision for you and your family.
  • Fully remote team — Remote-first with hubs in Phoenix, Boulder, and Miami, plus periodic team offsites.
  • Direct impact — Your work will shape how thousands of GPU clusters are deployed and operated across the AI ecosystem.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

11-50 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service