AI Software Engineer

Zoom•Seattle, WA

2d•$151,800 - $332,200•Hybrid

About The Position

The AI Infrastructure team at Zoom is focused on enabling high-performance AI across Zoom’s products and services. The team builds the core systems that support model training, deployment, and inference at scale, driving innovation in areas such as real-time communication, computer vision, and natural language understanding. The role involves designing, implementing, and owning the inference systems that serve Zoom's AI models at production scale, across real-time communication, vision, and language workloads. This includes hands-on work with kernel-level optimization, inference framework internals, and production serving infrastructure, collaborating with research and platform teams to optimize latency, throughput, and cost.

Requirements

A Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field
5+ years of software engineering experience, with significant time spent on inference systems or ML infrastructure at production depth
Hands-on experience with at least one major inference framework: vLLM, TensorRT-LLM, SGLang, or ONNX Runtime (serving, not just export)
GPU programming experience: CUDA kernel development, memory optimization, profiling with Nsight or equivalent
Production experience serving LLMs or large vision models, you've owned latency SLOs, debugged throughput regressions, and shipped optimizations that moved the needle
Depth in at least two of: speculative decoding, continuous batching, KV cache design, quantization pipelines, prefill/decode disaggregation
Strong systems instincts in Python and C++; ability to read and modify framework internals

Nice To Haves

Advanced degrees (Master’s or PhD) are advantageous
Experience with MoE models or 100B+ parameter deployments
Familiarity with disaggregated serving architectures or multi-node inference
Background in compiler-level optimization (XLA, Triton, or similar)

Responsibilities

Design and build high-performance inference serving systems for large-scale transformer and multimodal models (including 100B+ and MoE architectures)
Implement and tune inference optimizations: speculative decoding, continuous batching, KV cache management, prefill/decode disaggregation, and quantization (INT4/INT8/FP8)
Contribute to and customize inference frameworks (vLLM, TensorRT-LLM, SGLang, or equivalent) for Zoom's production requirements
Write and profile CUDA kernels and custom ops where framework-level optimization is insufficient
Own end-to-end deployment: from model packaging and serving API design to latency SLO monitoring and incident response
Partner with research to translate model architecture changes into inference-efficient implementations
Drive technical design and set the bar for inference eng practices across the team

Benefits

Award-winning workplace culture
Variety of perks, benefits, and options to help employees maintain their physical, mental, emotional, and financial health
Support work-life balance
Contribute to their community in meaningful ways

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume