AI inference is the process through which the knowledge stored in large AI models becomes accessible. When provided as a service to a significant user base, inference pipelines must be performant--achieve low latency and high throughput while maintaining scalability. The goal of this project is to explore and design novel KV-cache eviction policies that reduce latency and memory usage at large sequence lengths during inference. This project will target the vLLM inference framework. The strategies explored will be budget-constrained techniques based on characterizing the vector space of tokens stored in the KV-cache. The key idea is to leverage the structure of the token embeddings to identify less important blocks to evict in a relevance- and diversity-aware manner, minimizing their impact on attention computations for incoming tokens. This approach improves upon methods like PagedEviction, which rely on crude heuristics that ignore inter-block information redundancy or diversity within the cache. Education and Experience Requirements The entirety of the appointment must be conducted within the United States. Applicants must be: o Currently enrolled in undergraduate or graduate studies at an accredited institution. o Graduated from an accredited institution within the past 3 months; or o Actively enrolled in a graduate program at an accredited institution. Must be 18 years or older at the time the appointment begins. Must possess a cumulative GPA of 3.0 on a 4.0 scale. If accepting an offer, candidates may be required to complete pre-employment drug testing based on appointment length. All students remain subject to applicable drug testing policies. Must complete a satisfactory background check.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Intern
Education Level
No Education Listed
Number of Employees
1,001-5,000 employees