AI Software Engineer

ZoomSeattle, WA
Hybrid

About The Position

The Team You will join a dynamic AI Infrastructure team focused on enabling high-performance AI across Zoom’s products and services. The team builds the core systems that support model training, deployment, and inference at scale, driving innovation in areas such as real-time communication, computer vision, and natural language understanding. What You Can Expect You'll design, implement, and own the inference systems that serve Zoom's AI models at production scale, across real-time communication, vision, and language workloads. You'll be hands-on with kernel-level optimisation, inference framework internals, and production serving infrastructure, working closely with research and platform teams to push the boundary on latency, throughput, and cost.

Requirements

  • 5+ years of software engineering experience, with significant time spent on inference systems or ML infrastructure at production depth
  • Hands-on experience with at least one major inference framework: vLLM, TensorRT-LLM, SGLang, or ONNX Runtime (serving, not just export)
  • GPU programming experience: CUDA kernel development, memory optimisation, profiling with Nsight or equivalent
  • Production experience serving LLMs or large vision models, you've owned latency SLOs, debugged throughput regressions, and shipped optimisations that moved the needle
  • Depth in at least two of: speculative decoding, continuous batching, KV cache design, quantisation pipelines, prefill/decode disaggregation
  • Strong systems instincts in Python and C++; ability to read and modify framework internals

Nice To Haves

  • Experience with MoE models or 100B+ parameter deployments
  • Familiarity with disaggregated serving architectures or multi-node inference
  • Background in compiler-level optimisation (XLA, Triton, or similar)

Responsibilities

  • Design and build high-performance inference serving systems for large-scale transformer and multimodal models (including 100B+ and MoE architectures)
  • Implement and tune inference optimisations: speculative decoding, continuous batching, KV cache management, prefill/decode disaggregation, and quantisation (INT4/INT8/FP8)
  • Contribute to and customise inference frameworks (vLLM, TensorRT-LLM, SGLang, or equivalent) for Zoom's production requirements
  • Write and profile CUDA kernels and custom ops where framework-level optimisation is insufficient
  • Own end-to-end deployment: from model packaging and serving API design to latency SLO monitoring and incident response
  • Partner with research to translate model architecture changes into inference-efficient implementations
  • Drive technical design and set the bar for inference eng practices across the team

Benefits

  • bonus
  • equity value

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

No Education Listed

Number of Employees

5,001-10,000 employees

© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service