Abstract
StateKV enables efficient long-video vision-language model inference by maintaining cross-frame context in a fixed-capacity recurrent state while using a full per-frame cache for decoding, achieving linear-time prefill with minimal accuracy loss compared to full self-attention.
Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.
Community
StateKV is an inference-time method that enables linear-time video prefill for video vision-language models by using a fixed-capacity recurrent state, improving efficiency for long video understanding.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion (2026)
- OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation (2026)
- CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding (2026)
- KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy (2026)
- Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs (2026)
- Mosaic: Cross-Modal Clustering for Efficient Video Understanding (2026)
- Reducing Peak Memory Usage for Modern Multimodal Large Language Model Pipelines (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.31598 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper