AI Software Engineer

ZoomSeattle, WA
$151,800 - $332,200Hybrid

About The Position

The Team You will join a dynamic AI Infrastructure team focused on enabling high-performance AI across Zoom’s products and services. The team builds the core systems that support model training, deployment, and inference at scale, driving innovation in areas such as real-time communication, computer vision, and natural language understanding. What You Can Expect You'll design, implement, and own the inference systems that serve Zoom's AI models at production scale, across real-time communication, vision, and language workloads. You'll be hands-on with kernel-level optimisation, inference framework internals, and production serving infrastructure, working closely with research and platform teams to push the boundary on latency, throughput, and cost.

Requirements

  • 5+ years of software engineering experience, with significant time spent on inference systems or ML infrastructure at production depth
  • Hands-on experience with at least one major inference framework: vLLM, TensorRT-LLM, SGLang, or ONNX Runtime (serving, not just export)
  • GPU programming experience: CUDA kernel development, memory optimisation, profiling with Nsight or equivalent
  • Production experience serving LLMs or large vision models, you've owned latency SLOs, debugged throughput regressions, and shipped optimisations that moved the needle
  • Depth in at least two of: speculative decoding, continuous batching, KV cache design, quantisation pipelines, prefill/decode disaggregation
  • Strong systems instincts in Python and C++; ability to read and modify framework internals

Nice To Haves

  • Experience with MoE models or 100B+ parameter deployments
  • Familiarity with disaggregated serving architectures or multi-node inference
  • Background in compiler-level optimisation (XLA, Triton, or similar)

Responsibilities

  • Design and build high-performance inference serving systems for large-scale transformer and multimodal models (including 100B+ and MoE architectures)
  • Implement and tune inference optimisations: speculative decoding, continuous batching, KV cache management, prefill/decode disaggregation, and quantisation (INT4/INT8/FP8)
  • Contribute to and customise inference frameworks (vLLM, TensorRT-LLM, SGLang, or equivalent) for Zoom's production requirements
  • Write and profile CUDA kernels and custom ops where framework-level optimisation is insufficient
  • Own end-to-end deployment: from model packaging and serving API design to latency SLO monitoring and incident response
  • Partner with research to translate model architecture changes into inference-efficient implementations
  • Drive technical design and set the bar for inference eng practices across the team

Benefits

  • bonus
  • equity value
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service