Revolutionizing Sequential Recommendation Models with CollectiveKV

Global: Cross-User KV Sharing Cuts Cache Size to 0.8% in Sequential Recommendation Models

A new method named CollectiveKV has been introduced to address latency and storage concerns in sequential recommendation systems. The approach leverages shared key‑value (KV) representations across users, allowing the cache to be compressed to only 0.8% of its original size while preserving, and in some cases improving, model accuracy.

Background

Sequential recommendation models are integral to many online services, providing personalized item suggestions based on a user’s interaction history. These models increasingly rely on Transformer architectures, whose attention mechanisms improve predictive performance but also introduce computational complexity that scales with sequence length.

KV Cache Challenges

To mitigate inference latency, practitioners have adopted KV cache techniques that store intermediate representations. However, the cache can become a storage bottleneck, especially for platforms with large user bases and long interaction histories, because each user’s cache must be retained separately.

Cross-User Similarities

Analysis of KV matrices from multiple users revealed substantial overlap, suggesting that much of the stored information is not unique to any single user. Singular value decomposition (SVD) further showed that the majority of KV information is shareable, while only a small component remains user‑specific.

CollectiveKV Mechanism

Building on these observations, CollectiveKV introduces a learnable global KV pool that captures the shared component. During inference, a user retrieves high‑dimensional shared KV from this pool and concatenates it with a low‑dimensional, user‑specific KV segment to reconstruct the full cache required for the model.

Experimental Validation

Tests conducted on five distinct sequential recommendation models across three public datasets demonstrated that the KV cache could be reduced to 0.8% of its original size. Despite the dramatic compression, model performance was maintained or modestly enhanced, confirming the efficacy of the shared‑plus‑specific KV structure.

Potential Impact

If adopted broadly, CollectiveKV could lower infrastructure costs for services that depend on real‑time recommendation, enable faster response times, and simplify scaling to larger user populations. Future research may explore extending the shared KV concept to other domains that employ Transformer‑based inference.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Cross-User KV Sharing Cuts Cache Size to 0.8% in Sequential Recommendation Models