This role involves leading the optimization of the full inference pipeline for Large Models (LLM, Multimodal), focusing on KV Cache storage strategies, Router architecture design, and collaborative operator optimization to maximize throughput and minimize latency. It also includes conducting in-depth research into heterogeneous computing, evaluating hardware accelerator suitability for various inference scenarios, and developing standardized optimization schemes. The engineer will design and implement high-performance inference frameworks, optimizing scheduling and memory management to address distributed inference challenges. A key aspect is tracking global advancements in inference technology and driving their productization, while also providing technical leadership to overcome bottlenecks, design technical roadmaps, and mentor team members.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
Ph.D. or professional degree
Number of Employees
5,001-10,000 employees