PitVQA Unified Model v2
A multi-task vision-language model for pituitary surgery understanding, trained with 4-stage curriculum learning.
Model Description
This model fine-tunes Qwen2-VL-2B-Instruct using LoRA adapters for surgical scene understanding. It handles multiple tasks through specialized adapter stages:
| Stage | Task | Adapter | Description |
|---|---|---|---|
| 1 | Point Localization | stage1 |
<point x='45.2' y='68.3'>suction device</point> |
| 2 | Bounding Box | stage2 |
<box x1='20' y1='30' x2='60' y2='70'>tumor</box> |
| 3 | Motion Detection | stage3 |
Temporal motion analysis between frames |
| 4 | Unified | stage4 |
All tasks combined (recommended) |
Training Details
- Base Model: Qwen/Qwen2-VL-2B-Instruct (2B parameters)
- Method: LoRA (r=16, alpha=32) - only 18M trainable parameters
- Dataset: mmrech/pitvqa-comprehensive-spatial
- Training: SFT (Supervised Fine-Tuning) with TRL
Usage
Load with Adapters (Full Flexibility)
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from peft import PeftModel
import torch
# Load base model (4-bit quantized for efficiency)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
base = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-2B-Instruct",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
# Load unified adapter (stage4 - recommended for most tasks)
model = PeftModel.from_pretrained(base, "mmrech/pitvqa-qwen2vl-unified-v2",
adapter_name="stage4", subfolder="stage4")
# Or load multiple adapters and switch between them
model.load_adapter("mmrech/pitvqa-qwen2vl-unified-v2", adapter_name="stage1", subfolder="stage1")
model.load_adapter("mmrech/pitvqa-qwen2vl-unified-v2", adapter_name="stage2", subfolder="stage2")
model.set_adapter("stage4") # Switch to unified adapter
Inference Example
from PIL import Image
# Load surgical image
image = Image.open("surgical_frame.jpg")
# Point localization
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Point to the suction device in this surgical image."}
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=128, do_sample=False)
response = processor.decode(output[0], skip_special_tokens=True)
# Output: <point x='75.8' y='75.1'>suction device</point>
For Easier Deployment
Use the merged model which has Stage 4 baked into the weights - no adapter loading required.
Supported Tasks
1. Point Localization (Stage 1 or 4)
Prompt: "Point to the {target} in this surgical image."
Output: <point x='45.2' y='68.3'>suction device</point>
2. Bounding Box Detection (Stage 2 or 4)
Prompt: "Draw a bounding box around the {target}."
Output: <box x1='20' y1='30' x2='60' y2='70'>tumor region</box>
3. Phase Classification (Stage 4)
Prompt: "What surgical phase is shown?"
Output: sellar phase
Phases: nasal, sellar, tumor_removal, closure
4. Free-form Queries (Stage 4)
Prompt: "Describe the surgical instruments visible."
Output: The image shows a suction device in the lower right quadrant...
Model Files
mmrech/pitvqa-qwen2vl-unified-v2/
βββ stage1/ # Point localization adapter
βββ stage2/ # Bounding box adapter
βββ stage3/ # Motion detection adapter
βββ stage4/ # Unified adapter (all tasks)
βββ showcase_examples_full.json # Verified examples with images
Demo
Try the interactive demo: PitVQA Space
Citation
@misc{pitvqa2026,
title={PitVQA: Multi-Task Vision-Language Model for Pituitary Surgery},
author={Matheus Rech},
year={2026},
url={https://huggingface.co/mmrech/pitvqa-qwen2vl-unified-v2}
}
License
Apache 2.0
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support