Nemotron ColEmbed V2
Model Overview
Description
The nvidia/nemotron-colembed-vl-4b-v2 is a state-of-the-art late interaction embedding model that ranks No. 3 in the ViDoRe V3: a comprehensive evaluation of retrieval for enterprise use-case benchmark, (as of Jan 26, 2026) with a score of 61.42 on 8 public tasks. The model was fine-tuned for query-document retrieval. Users can input queries, which are text, or documents which are page images, to the model. The model outputs ColBERT-style multi-vector numerical representations for input queries and documents.
✨ Key Improvements:
- ⚗️ Advanced Model Merging: Utilizes post-training model merging to combine the strengths of multiple fine-tuned checkpoints. This delivers the accuracy stability of an ensemble without any additional inference latency.
- 🌍 Enhanced Synthetic Data: We significantly enriched our training mixture with diverse multilingual synthetic data, improving semantic alignment across languages and complex document types.
This model is for non-commercial/research use only.
License/Terms of Use
The use of this model is governed by the Creative Commons Attribution-NonCommercial 4.0 license, and the use of the post-processing scripts are licensed under Apache 2.0. Additional Information: Built with Qwen3-VL which is released under Apache 2.0.
This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.
Deployment Geography
Global
Use Case
nemotron-colembed-vl-4b-v2 is intended for researchers exploring applications that must understand or retrieve information across both text and image modalities. It is instrumental in multimodal RAG systems, where queries are in text format and documents are images, such as pages, text, charts, tables or infographics. Potential applications include multimedia search engines, cross-modal retrieval systems, and conversational AI with rich input understanding.
Release Date
01/26/2026 via https://huggingface.co/nvidia/nemotron-colembed-vl-4b-v2
Model Architecture
- Architecture Type: Transformer
- Network Architecture: Qwen3-VL-4B-Instruct based encoder.
The nemotron-colembed-vl-4b-v2 is a transformer-based multimodal embedding model built from Qwen3-VL-4B-Instruct, which adopts a three-module architecture comprising a vision encoder (the SigLIP-2 architecture), an MLP-based vision–language merger, and a large language model (LLM) (see technical report for details). It has approximately 4.8B parameters.
Input(s):
Input Type(s): Image, Text
Input Format(s):
- Image: List of images- Red, Green, Blue (RGB)
- Text: List of Strings
Input Parameters:
- Image: Two-Dimensional (2D)
- Text: One-Dimensional (1D)
Other Properties Related to Input:
- The model's maximum context length we evaluated is 10240 tokens.
- Each image tile consumes 256 tokens. We have tested this model extensively with these settings on config.json -
max_input_tiles = 8,use_thumbnails = True, so that every image is split into maximum of 8 tiles + 1 thumbnail (whole image at lower resolution). Images must be python PIL format. The model will scale the image into multiple tiles of 512x512.
Outputs
- Output Type: Floats
- Output Format: List of float arrays
- Output Parameters: The list of floats equivalent to [batchsize x seq length x embedding_dim]
- Other Properties Related to Output: For each input token, the model outputs a 2560-dimensional embedding vector of floating-point values.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Installation
The model requires transformers version 5.0.0rc0 and flash attention installed.
pip install transformers==5.0.0rc0
pip install flash-attn==2.6.3 --no-build-isolation
Depending on your environment you might need to upgrade polars and pydantic:
pip install -U datasets polars
pip install -U pydantic
Transformers Usage
import requests
from PIL import Image
from io import BytesIO
import torch
from transformers import AutoModel
from transformers.image_utils import load_image
# Load Model
model = AutoModel.from_pretrained(
'nvidia/nemotron-colembed-vl-4b-v2',
device_map='cuda',
trust_remote_code=True,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2"
).eval()
# Queries
queries = [
'How is AI improving the intelligence and capabilities of robots?',
'Canary, a multilingual model that transcribes speech in English, Spanish, German, and French with punctuation and capitalization.',
'Generative AI can generate DNA sequences that can be translated into proteins for bioengineering.'
]
image_urls = [
"https://developer.download.nvidia.com/images/isaac/nvidia-isaac-lab-1920x1080.jpg",
"https://developer-blogs.nvidia.com/wp-content/uploads/2024/03/asr-nemo-canary-featured.jpg",
"https://blogs.nvidia.com/wp-content/uploads/2023/02/genome-sequencing-helix.jpg"
]
# Load all images (load_image handles both local paths and URLs)
images = [load_image(img_path) for img_path in image_urls]
# Encoding
query_embeddings = model.forward_queries(queries, batch_size=8)
image_embeddings = model.forward_images(images, batch_size=8)
scores = model.get_scores(
query_embeddings,
image_embeddings
)
# Diagonal should have higher scores
print(scores)
# tensor([[21.5332, 21.1848, 20.9185],
# [32.4948, 33.2485, 32.5982],
# [26.0623, 26.1014, 26.5692]], device='cuda:0')
Software Integration:
Runtime Engine(s): Not Applicable
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere - A100 40GB and A100 80GB
- NVIDIA Hopper - H100 80GB
Supported Operating System(s): Linux
Model Version(s)
nemotron-colembed-vl-4b-v2
Training and Evaluation Datasets
Training Dataset
The model was trained on publicly available datasets, including DocMatix-IR, VDR, Vidore-ColPali-Training, VisRAG-Ret-Train-Synthetic-data, VisRAG-Ret-Train-In-domain-data, and Wiki-SS-NQ.
Data Modality: Image
Image Training Data Size
- Less than a Million Images
Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
Properties: Training: The vision embedding model was fine-tuned on approximately 500k image samples.
Evaluation Dataset
We evaluate the model on the datasets from ViDoRe V1, V2 and V3 Visual Document Retrieval benchmarks.
ViDoRe is a premier benchmark for Visual Document Retrieval and it is composed of various page-level retrieving tasks spanning multiple domains, languages, and settings. The latest version of the benchmark is Vidore V3, a comprehensive evaluation of retrieval for enterprise use-cases.
We provide a script using MTEB 2 library to evaluate ColEmbed models on ViDoRe benchmarks.
- Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
- Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
- Properties: More details on ViDoRe V1 and ViDoRe V2 can be found on their leaderboard. Visual Document Retrieval Benchmark,
Evaluation Results
ViDoRE V1&V2 and V3 on MTEB leaderboards
pip install "mteb>=2.7.0, <3.0.0"
# Evaluates with Vidore V1 and V2
CUDA_VISIBLE_DEVICES=0; python3 mteb2_eval.py --model_name nvidia/nemotron-colembed-vl-4b-v2 --batch_size 16 --benchmark "VisualDocumentRetrieval"
# Evaluates with Vidore V3
CUDA_VISIBLE_DEVICES=0; python3 mteb2_eval.py --model_name nvidia/nemotron-colembed-vl-4b-v2 --batch_size 16 --benchmark "ViDoRe(v3)"
# Evaluates with a specific task/dataset of Vidore V3: Vidore3ComputerScienceRetrieval
CUDA_VISIBLE_DEVICES=0; python3 mteb2_eval.py --model_name nvidia/nemotron-colembed-vl-4b-v2 --batch_size 16 --benchmark "ViDoRe(v3)" --task-list Vidore3ComputerScienceRetrieval
In this section, we evaluate the performance of nemotron-colembed-vl-4b-v2 against other models that previously achieved top-five rankings on the leaderboards.
We report results on the ViDoRe benchmark suite. The tables below summarize the image-modality accuracy of nemotron-colembed-vl-4b-v2 on the ViDoRe V1, V2, and V3 benchmarks, alongside other NVIDIA nemotron-colembed models. Note that (M)MTEB leaderboards use Borda ranking. Each task acts like a voter that ranks models based on how well they perform. Models earn more points when they rank higher on a task. The model with the most total points across all tasks gets the top overall rank.
ViDoRe V3 (NDCG@10)
| Model | Avg | CompSci | Energy | FinanceEn | FinanceFr | HR | Industrial | Pharma | Physics |
|---|---|---|---|---|---|---|---|---|---|
| nemotron-colembed-8b | 63.54 | 79.30 | 69.82 | 67.29 | 51.54 | 66.32 | 56.03 | 67.19 | 50.84 |
| tomoro-colqwen3-8b | 61.60 | 75.35 | 68.41 | 65.08 | 49.10 | 63.98 | 54.41 | 66.36 | 50.13 |
| nemotron-colembed-4b | 61.42 | 78.56 | 67.48 | 65.02 | 49.01 | 62.39 | 53.91 | 66.10 | 48.86 |
| tomoro-colqwen3-4b | 60.16 | 75.44 | 66.43 | 63.84 | 46.83 | 60.09 | 53.58 | 65.74 | 49.32 |
| nemotron-colembed-3b-v2 | 59.70 | 77.09 | 64.88 | 64.23 | 44.41 | 62.28 | 51.71 | 66.04 | 46.93 |
| nomic-ai/colnomic-embed-multimodal-7b | 57.64 | 76.20 | 63.58 | 56.57 | 45.46 | 58.67 | 50.13 | 62.26 | 48.25 |
| jinaai/jina-embeddings-v4 | 57.54 | 71.81 | 63.50 | 59.30 | 46.10 | 59.53 | 50.38 | 63.09 | 46.63 |
ViDoRe V2 (NDCG@5)
| Model | Avg | BioMedicalLectures | ESGReportsHL | ESGReports | EconomicsReports |
|---|---|---|---|---|---|
| tomoro-colqwen3-8b | 65.40 | 65.47 | 75.98 | 60.71 | 59.46 |
| EvoQwen2.5-VL-Retriever-7B-v1 | 65.24 | 65.20 | 76.98 | 59.67 | 59.13 |
| nemotron-colembed-8b | 65.16 | 66.16 | 73.15 | 60.56 | 60.76 |
| tomoro-colqwen3-4b | 64.69 | 65.38 | 74.65 | 62.44 | 56.30 |
| nemotron-colembed-4b | 64.49 | 64.32 | 71.43 | 61.48 | 60.75 |
| nemotron-colembed-3b-v2 | 63.38 | 63.19 | 73.11 | 58.64 | 58.59 |
| nemotron-colembed-3b-v1 | 63.32 | 62.70 | 75.38 | 57.38 | 57.84 |
ViDoRe V1 (NDCG@5)
| Model | Avg | ArxivQA | DocVQA | InfoVQA | Shift | Syn-AI | Syn-Energy | Syn-Gov | Syn-Health | TabFQuAD | Tatdqa |
|---|---|---|---|---|---|---|---|---|---|---|---|
| nemotron-colembed-8b | 92.65 | 93.08 | 68.05 | 94.56 | 93.30 | 100.00 | 97.89 | 98.89 | 99.63 | 97.74 | 83.37 |
| nemotron-colembed-3b-v2 | 91.74 | 90.40 | 67.17 | 94.68 | 92.00 | 100.00 | 98.02 | 97.95 | 98.89 | 97.25 | 81.04 |
| nemotron-colembed-4b | 91.62 | 92.03 | 67.39 | 93.31 | 92.26 | 99.26 | 96.19 | 98.02 | 98.52 | 98.05 | 81.19 |
| nemotron-colembed-3b-v1 | 91.00 | 88.35 | 66.21 | 94.92 | 90.70 | 99.63 | 96.63 | 97.82 | 99.26 | 95.94 | 80.57 |
| tomoro-colqwen3-8b | 90.76 | 91.15 | 66.37 | 94.48 | 87.89 | 99.26 | 96.71 | 97.58 | 99.06 | 94.23 | 80.92 |
| EvoQwen2.5-VL-Retriever-7B-v1 | 90.68 | 91.49 | 65.07 | 94.11 | 88.80 | 99.63 | 96.63 | 96.29 | 98.89 | 93.63 | 82.26 |
| tomoro-colqwen3-4b | 90.57 | 90.58 | 66.30 | 94.31 | 87.39 | 99.26 | 96.91 | 97.17 | 99.63 | 94.33 | 79.87 |
Inference:
Acceleration Engine: Not Applicable
Test Hardware: A100 40GB, A100 80GB, H100 80GB
Citation
@misc{xu2025llamanemoretrievercolembedtopperforming,
title={Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model},
author={Mengyao Xu and Gabriel Moreira and Ronay Ak and Radek Osmulski and Yauhen Babakhin and Zhiding Yu and Benedikt Schifferer and Even Oldridge},
year={2025},
eprint={2507.05513},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.05513},
}
@misc{moreira2025nvretrieverimprovingtextembedding,
title={NV-Retriever: Improving text embedding models with effective hard-negative mining},
author={Gabriel de Souza P. Moreira and Radek Osmulski and Mengyao Xu and Ronay Ak and Benedikt Schifferer and Even Oldridge},
year={2025},
eprint={2407.15831},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.15831},
}
@article{Qwen3-VL,
title={Qwen3-VL Technical Report},
author={Shuai Bai and Yuxuan Cai and Ruizhe Chen and Keqin Chen and Xionghui Chen and Zesen Cheng and Lianghao Deng and Wei Ding and Chang Gao and Chunjiang Ge and Wenbin Ge and Zhifang Guo and Qidong Huang and Jie Huang and Fei Huang and Binyuan Hui and Shutong Jiang and Zhaohai Li and Mingsheng Li and Mei Li and Kaixin Li and Zicheng Lin and Junyang Lin and Xuejing Liu and Jiawei Liu and Chenglong Liu and Yang Liu and Dayiheng Liu and Shixuan Liu and Dunjie Lu and Ruilin Luo and Chenxu Lv and Rui Men and Lingchen Meng and Xuancheng Ren and Xingzhang Ren and Sibo Song and Yuchong Sun and Jun Tang and Jianhong Tu and Jianqiang Wan and Peng Wang and Pengfei Wang and Qiuyue Wang and Yuxuan Wang and Tianbao Xie and Yiheng Xu and Haiyang Xu and Jin Xu and Zhibo Yang and Mingkun Yang and Jianxin Yang and An Yang and Bowen Yu and Fei Zhang and Hang Zhang and Xi Zhang and Bo Zheng and Humen Zhong and Jingren Zhou and Fan Zhou and Jing Zhou and Yuanzhi Zhu and Ke Zhu},
journal={arXiv preprint arXiv:2511.21631},
year={2025}
}
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
- Downloads last month
- 108