VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval
Abstract
Generative multimodal large language models are adapted for video-text embedding and retrieval through intermediate-layer analysis and text-based alignment without visual supervision.
Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we focus on leveraging MLLMs for video-text embedding and retrieval. We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information. Leveraging this insight, we demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training. Building on these findings, we introduce a lightweight text-based alignment strategy which maps dense video captions to short summaries and enables task-related video-text embedding learning without visual supervision. Remarkably, without any fine-tuning beyond text, our method outperforms current methods, often by a substantial margin, achieving state-of-the-art results across common video retrieval benchmarks.
Community
What if your multimodal LLM already contains strong video representations—strong enough to beat Video Foundation Models? 🤔
VidVec 🎥 : Unlocking Video MLLM Embeddings for Video-Text Retrieval
Key contributions (short):
✅ Layer-wise insight: intermediate MLLM layers already encode strong video–text representations
✅ Zero-shot 2-stage retrieval (no training): intermediate-layer embeddings + calibrated MLLM head for reranking
📈 Recall gains vs trained MLLM Embedders: +3.2% (MSR-VTT) · +7.7% (VATEX) · +9.4% (DiDeMo)
📝 Text-only “in-context” optimization: ~60K dense caption → short summary pairs (no visual supervision)
🏆 SoTA video–text retrieval performance across MSR-VTT / MSVD / VATEX / DiDeMo
Project page: https://iyttor.github.io/VidVec
arXiv: https://www.arxiv.org/abs/2602.08099
code and weights are coming soon...
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval (2026)
- VIRTUE: Versatile Video Retrieval Through Unified Embeddings (2026)
- X-Aligner: Composed Visual Retrieval without the Bells and Whistles (2026)
- TextME: Bridging Unseen Modalities Through Text Descriptions (2026)
- TARA: Simple and Efficient Time Aware Retrieval Adaptation of MLLMs for Video Understanding (2025)
- Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs (2026)
- LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper