Machine Learning Intern - Dynamic KV-Cache Modeling for Efficient LLM Inference

d-Matrix•Santa Clara, CA

15d

About The Position

At d-Matrix, we are focused on unleashing the potential of generative AI to power the transformation of technology. We are at the forefront of software and hardware innovation, pushing the boundaries of what is possible. Our culture is one of respect and collaboration. We value humility and believe in direct communication. Our team is inclusive, and our differing perspectives allow for better solutions. We are seeking individuals passionate about tackling challenges and are driven by execution. Ready to come find your playground? Together, we can help shape the endless possibilities of AI. We are seeking a motivated and innovative Machine Learning Intern to join our team. The intern will work on developing a dynamic Key-Value (KV) cache solution for Large Language Model (LLM) inference, aimed at enhancing memory utilization and execution efficiency on D-Matrix hardware. This project will involve modeling at the PyTorch graph level to enable efficient, torch-native support for KV-Cache, addressing limitations in current solutions.

Requirements

Currently pursuing a degree in Computer Science, Electrical Engineering, Machine Learning, or a related field.
Familiarity with PyTorch and deep learning concepts, particularly regarding model optimization and memory management.
Understanding of CUDA programming and hardware-accelerated computation (experience with CUDA is a plus).
Strong programming skills in Python, with experience in PyTorch.
Analytical mindset with the ability to approach problems creatively.

Nice To Haves

Experience with deep learning model inference optimization.
Knowledge of data structures used in machine learning for memory and compute efficiency.
Experience with hardware-specific optimization, especially on custom hardware like D-Matrix, is an advantage.

Responsibilities

Research and analyze existing KV-Cache implementations used in LLM inference, particularly those utilizing lists of past-key-values PyTorch tensors.
Investigate “Paged Attention” mechanisms that leverage dedicated CUDA data structures to optimize memory for variable sequence lengths.
Design and implement a torch-native dynamic KV-Cache model that can be integrated seamlessly within PyTorch.
Model KV-Cache behavior within the PyTorch compute graph to improve compatibility with torch.compile and facilitate the export of the compute graph.
Conduct experiments to evaluate memory utilization and inference efficiency on D-Matrix hardware.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume