🧠 Thinking with Visual Primitives (3B Proof-of-Concept)

This repository provides the inference code and LoRA adapter weights for a 3B-parameter proof-of-concept replication of the "Thinking with Visual Primitives" paradigm introduced by DeepSeek-AI.

While standard Multimodal LLMs have largely solved the Perception Gap through high-resolution cropping, they still suffer from the Reference Gap: the inherent inability of natural language to serve as a precise, unambiguous pointer within a continuous visual space. This model elevates spatial markersβ€”specifically bounding boxesβ€”to "minimal units of thought". By interleaving these visual primitives directly into its Chain-of-Thought (CoT), the model can literally "point" while it "reasons", effectively anchoring abstract linguistic thoughts onto concrete spatial coordinates.

Note: This is an independent, open-weight 3B proof-of-concept designed to demonstrate the architectural viability of visual primitive grounding. The original paper utilizes a proprietary 284B-A13B MoE architecture. Training code is not included in this release.

πŸ“Š Training Dataset

This model was trained exclusively on the COCO Object Detection dataset (detection-datasets/coco):

  • SFT Phase: 50,000 samples from the COCO train split, filtered using a Visual-Geometric Quality Review to remove Mega Boxes (>90% area) and tiny ambiguous boxes (<1% area).
  • GRPO (RL) Phase: 5,000 samples from the COCO validation split, filtered for "Normal-Level" difficulty (2–10 objects per image, target object occupying 5–60% of the image area) to ensure non-trivial RL learning signals.

πŸ—οΈ Architecture

The model is built on a highly optimized, lightweight vision-language pipeline mirroring the paper's token-efficiency philosophy:

  • Vision Encoder: google/siglip-so400m-patch14-384
  • Spatial Compressor: A 3x3 Average Pooling layer that compresses adjacent patch tokens to maximize visual token efficiency.
  • Projector: A 2-layer MLP (GELU) bridging the vision and text embedding spaces.
  • Language Backbone: Qwen/Qwen2.5-3B
  • Vocabulary Extension: Added special tokens <ref>, </ref>, <box>, </box>, <point>, </point> to natively support visual primitive generation.

Output Format:

1. **Intent Analysis**: The user wants me to locate the [object].
2. **Visual Grounding**: Scanning the scene... <ref>object</ref><box>[[x1,y1,x2,y2]]</box>
3. **Conclusion**: Coordinates anchored.

(Coordinates are normalized to a 0-999 discrete grid relative to the padded image dimensions).


🎯 Example Inference

Prompt: "Locate the person in this image."

Model Output:

  1. Intent Analysis: The user wants me to locate the person in the image.
  2. Visual Grounding: Scanning the scene, I have identified the target entity. person[[327,162,625,825]]
  3. Conclusion: The spatial coordinates have been successfully anchored.

Grounding Example


πŸš€ Quick Start (Inference)

Ensure you have the required dependencies installed:

pip install torch transformers peft accelerate Pillow

Command Line Usage

python inference.py --image "path/to/image.jpg" --target "person"

Python API Usage

import torch
import re
from PIL import Image, ImageDraw, ImageFont
from transformers import AutoProcessor, AutoTokenizer
from peft import PeftModel
from model import VisualPrimitiveModel

# 1. Load Tokenizer and Processor
llm_id = "Qwen/Qwen2.5-3B"
vit_id = "google/siglip-so400m-patch14-384"
adapter_id = "MeowML/Visual-Primitives-Qwen2.5-3B"

tokenizer = AutoTokenizer.from_pretrained(llm_id)
tokenizer.add_tokens(["<ref>", "</ref>", "<box>", "</box>", "<point>", "</point>"])
processor = AutoProcessor.from_pretrained(vit_id)

# 2. Load Base Model and LoRA Adapter
base_model = VisualPrimitiveModel(llm_path=llm_id, vit_path=vit_id)
base_model.llm.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(base_model, adapter_id)
model = model.to("cuda", dtype=torch.float16).eval()

# 3. Prepare Image and Prompt (with square padding)
image = Image.open("test_image.jpg").convert("RGB")
orig_w, orig_h = image.size
max_dim = max(orig_w, orig_h)
padded_image = Image.new("RGB", (max_dim, max_dim), (255, 255, 255))
pad_x = (max_dim - orig_w) // 2
pad_y = (max_dim - orig_h) // 2
padded_image.paste(image, (pad_x, pad_y))

prompt = "Locate the main subject in this image and output its spatial location using the format: <ref>object_name</ref><box>[[x1,y1,x2,y2]]</box>."
chat_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"

inputs = tokenizer(chat_prompt, return_tensors="pt").to("cuda")
pixel_values = processor(images=padded_image, return_tensors="pt")["pixel_values"].to("cuda", dtype=torch.float16)

# 4. Generate
outputs = model.generate(**inputs, pixel_values=pixel_values, max_new_tokens=150, do_sample=False)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
print(generated_text)

# 5. Reverse Padding Math to get original coordinates
matches = re.findall(r'<box>\[\[(\d{1,3}),\s*(\d{1,3}),\s*(\d{1,3}),\s*(\d{1,3})\]\]</box>', generated_text)
if matches:
    x1, y1, x2, y2 = map(int, matches[-1])
    abs_x1 = max(0, min(int(((x1 / 999) * max_dim) - pad_x), orig_w))
    abs_y1 = max(0, min(int(((y1 / 999) * max_dim) - pad_y), orig_h))
    abs_x2 = max(0, min(int(((x2 / 999) * max_dim) - pad_x), orig_w))
    abs_y2 = max(0, min(int(((y2 / 999) * max_dim) - pad_y), orig_h))
    print(f"Original Coordinates: [{abs_x1}, {abs_y1}, {abs_x2}, {abs_y2}]")

πŸ“š Acknowledgements & Citation

This project is an independent architectural replication inspired by the groundbreaking research from DeepSeek-AI:

@article{lu2026thinking,
  title={Thinking with Visual Primitives},
  author={Lu, Ruijie and Ma, Yiyang and Chen, Xiaokang and Luo, Lingxiao and Wu, Zhiyu and Pan, Zizheng and Liu, Xingchao and Lin, Yutong and Li, Hao and Liu, Wen and others},
  journal={arXiv preprint},
  year={2026}
}

Special thanks to:

  • DeepSeek-AI for open-sourcing the "Thinking with Visual Primitives" methodology and highlighting the Reference Gap.
  • Qwen Team for the excellent Qwen2.5-3B base model.
  • Google for the SigLIP vision encoder.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for MeowML/Visual-Primitives-Qwen2.5-3B

Base model

Qwen/Qwen2.5-3B
Finetuned
(408)
this model

Dataset used to train MeowML/Visual-Primitives-Qwen2.5-3B