PitVQA Unified Model v2

A multi-task vision-language model for pituitary surgery understanding, trained with 4-stage curriculum learning.

Model Description

This model fine-tunes Qwen2-VL-2B-Instruct using LoRA adapters for surgical scene understanding. It handles multiple tasks through specialized adapter stages:

Stage Task Adapter Description
1 Point Localization stage1 <point x='45.2' y='68.3'>suction device</point>
2 Bounding Box stage2 <box x1='20' y1='30' x2='60' y2='70'>tumor</box>
3 Motion Detection stage3 Temporal motion analysis between frames
4 Unified stage4 All tasks combined (recommended)

Training Details

  • Base Model: Qwen/Qwen2-VL-2B-Instruct (2B parameters)
  • Method: LoRA (r=16, alpha=32) - only 18M trainable parameters
  • Dataset: mmrech/pitvqa-comprehensive-spatial
  • Training: SFT (Supervised Fine-Tuning) with TRL

Usage

Load with Adapters (Full Flexibility)

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load base model (4-bit quantized for efficiency)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# Load unified adapter (stage4 - recommended for most tasks)
model = PeftModel.from_pretrained(base, "mmrech/pitvqa-qwen2vl-unified-v2",
                                   adapter_name="stage4", subfolder="stage4")

# Or load multiple adapters and switch between them
model.load_adapter("mmrech/pitvqa-qwen2vl-unified-v2", adapter_name="stage1", subfolder="stage1")
model.load_adapter("mmrech/pitvqa-qwen2vl-unified-v2", adapter_name="stage2", subfolder="stage2")
model.set_adapter("stage4")  # Switch to unified adapter

Inference Example

from PIL import Image

# Load surgical image
image = Image.open("surgical_frame.jpg")

# Point localization
messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text", "text": "Point to the suction device in this surgical image."}
]}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=128, do_sample=False)

response = processor.decode(output[0], skip_special_tokens=True)
# Output: <point x='75.8' y='75.1'>suction device</point>

For Easier Deployment

Use the merged model which has Stage 4 baked into the weights - no adapter loading required.

Supported Tasks

1. Point Localization (Stage 1 or 4)

Prompt: "Point to the {target} in this surgical image."
Output: <point x='45.2' y='68.3'>suction device</point>

2. Bounding Box Detection (Stage 2 or 4)

Prompt: "Draw a bounding box around the {target}."
Output: <box x1='20' y1='30' x2='60' y2='70'>tumor region</box>

3. Phase Classification (Stage 4)

Prompt: "What surgical phase is shown?"
Output: sellar phase

Phases: nasal, sellar, tumor_removal, closure

4. Free-form Queries (Stage 4)

Prompt: "Describe the surgical instruments visible."
Output: The image shows a suction device in the lower right quadrant...

Model Files

mmrech/pitvqa-qwen2vl-unified-v2/
β”œβ”€β”€ stage1/          # Point localization adapter
β”œβ”€β”€ stage2/          # Bounding box adapter
β”œβ”€β”€ stage3/          # Motion detection adapter
β”œβ”€β”€ stage4/          # Unified adapter (all tasks)
└── showcase_examples_full.json  # Verified examples with images

Demo

Try the interactive demo: PitVQA Space

Citation

@misc{pitvqa2026,
  title={PitVQA: Multi-Task Vision-Language Model for Pituitary Surgery},
  author={Matheus Rech},
  year={2026},
  url={https://huggingface.co/mmrech/pitvqa-qwen2vl-unified-v2}
}

License

Apache 2.0

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mmrech/pitvqa-qwen2vl-unified-v2

Base model

Qwen/Qwen2-VL-2B
Adapter
(138)
this model

Dataset used to train mmrech/pitvqa-qwen2vl-unified-v2

Space using mmrech/pitvqa-qwen2vl-unified-v2 1