Head of Inference, Stealth Edge AI Co

Montauk CapitalNew York, NY
Remote

About The Position

We are seeking a visionary and execution-oriented Head of Inference. You'll define the inference architecture, make foundational decisions, build the first POC, and own this domain end to end alongside the CEO. You will be a senior, hands-on technical leader and the technical authority on inference in the room. You’ll own the key technical decisions, and will be the internal and external expert on inference. You will own the core inference capability driving the platform and customer experience, and have a strong voice over the technical foundation of the company. You’ll evolve the vision into a viable proof of concept, building the practical system to then design and implement distributed inference systems. Alongside the CEO, you’ll represent the company with top-tier partners, early customers and investors, and will own this domain end to end. In addition to the CEO, you will have the support of a team of strong advisors, and the initial founding team.

Requirements

  • You have a passion for inference and a background as a hands-on technical builder who has directly implemented inference systems before, ideally in production or near-production environments.
  • Deep knowledge and are excited about model serving, and the practical engineering required to make an inference system work on real hardware.
  • You can take a vision and initial concept and translate it into a viable POC quickly and are comfortable making foundational technical decisions quickly, in ambiguity, and building first of a kind.
  • Production inference serving — vLLM, TensorRT-LLM, Triton Inference Server, or equivalent distributed at scale
  • Quantization, SGLang, containerization, cost-per-token
  • Observability tooling:distributed tracing, latency profiling, alerting. Instrument and debug complex distributed systems with a focus on building world-class observability and debuggability tools
  • C++/CUDA/Rust
  • GPU utilization and CUDA kernel optimization — has pushed hardware to its limits
  • Batching, KV-cache, speculative decoding expertise
  • Scale systems using Kubernetes, Ray, custom load balancing, multi-GPU/multi-node inference
  • Has built a serving system that NVIDIA and cloud providers respect
  • Model deployment and serving
  • Systems engineering
  • Technical leadership experience, either over teams or outcomes
  • Startup / 0→1 DNA: You ship fast and communicate clearly

Responsibilities

  • Create the inference strategy and define the inference architecture for Edge AI
  • Own the inference serving layer end-to-end: vLLM, TensorRT-LLM, Triton, or equivalent
  • Build a credible POC fast — proves the platform works to NVIDIA, cloud providers, and customers
  • Drive cost-per-token optimization
  • Optimize GPU utilization, KV-cache management, and batching for production workloads
  • Own observability and reliability SLAs
  • Build distributed inference pipelines across multi-GPU, multi-node edge deployments
  • Set performance baselines and SLAs for inference latency and throughput, plus observability and performance SLA’s
  • Define quantization strategy
  • Translate complex inference requirements for infrastructure designs
  • Define the software access layer architecture and oversee integration efforts
  • Engage credibly with investors, partners, and technical stakeholders, represent the company externally

Benefits

  • Competitive compensation + equity: True ownership over what you build
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service