Daily Model Scout Report — 2026-04-07

by msudharsanan - opened Apr 7

Denali Advanced Integration org Apr 7

Daily Model Scout Report — 2026-04-07

Scope: New / recently-active VLMs on HuggingFace (last ~7 days), evaluated for our 9-field garment JSON extraction task. Hardware: RTX PRO 6000 (98 GB).

Current bests on the 3,500-sample hard eval (_overall.weighted_score):

qwen3-vl-8b-sft+grpo — 0.9131 (best overall)
qwen3-vl-8b-sft-grpo-nvfp4 — 0.8945 (best quantized)
qwen3-vl-2b-sft-grpo-v9 — 0.8948 (best small)
qwen35-2b-base — 0.8437 (best Qwen3.5 base)

HIGH relevance — benchmark this week

1. Qwen3.5-VL family

HF: https://huggingface.co/collections/Qwen/qwen3-vl
Sizes: 0.8B, 2B, 4B, 9B dense; 35B-A3B and 122B-A10B MoE; 397B-A17B flagship.
Architecture: Gated Delta Networks + sparse MoE, early-fusion vision tokens, 262K native context, 201 languages.
Why: Most likely candidate to beat 0.9131 with our existing SFT+GRPO recipe. 9B dense fits trivially; 35B-A3B MoE activates only 3B/token and also fits.

2. InternVL3.5-8B

HF: https://huggingface.co/OpenGVLab/InternVL3_5-8B
InternViT-300M + Qwen3-8B backbone. 38B / 78B siblings exist.
Why: Direct size match to qwen3-vl-8b-sft+grpo. Best non-Qwen 8B option for fine-grained attribute extraction.

MEDIUM

3. IBM Granite 4.0 3B Vision

HF: https://huggingface.co/ibm-granite/granite-4.0-3b-vision
3.5B base + 0.5B LoRA adapter. Apache-2.0.
DeepStack injection; trained for table / chart / KVP extraction — closer to our 9-field JSON than generic VQA. 85.5% in-domain zero-shot exact-match; #3 in 2–4B class on VAREX.
Risk: document-centric pretraining may not transfer to natural garment photography. Cheap to probe.

4. GLM-4.6V-Flash (9B)

HF: https://huggingface.co/zai-org/GLM-4.6V-Flash (106B sibling: https://huggingface.co/zai-org/GLM-4.6V)
Multimodal reasoning, native tool use, 128K context, scalable RL pretraining. Native transformers support.
Why: Different lineage than Qwen → ensemble diversity. GLM-4.5V already showed strong structured-extraction results.

5. MiniCPM-V 4.5 (8B)

HF: https://huggingface.co/openbmb/MiniCPM-V-4_5
Qwen3-8B + SigLIP2-400M encoder.
Why: SigLIP2 is one of the strongest open encoders for fine-grained color/pattern. Same LLM as InternVL3.5-8B → clean encoder ablation.

LOW

Molmo2 — listed in 2026 OS-VLM survey, no fresh HF checkpoints in the last 7 days. Watchlist.
Fashion / clothing fine-tunes — none new this week. FashionCLIP 2.0 / EMaghakyan/fashion-clip are CLIP-class embedders only — possible re-ranker / auxiliary loss.

Recommended actions

Zero-shot benchmark on 3.5k-hard (HIGH): Qwen/Qwen3.5-VL-2B, Qwen/Qwen3.5-VL-9B, OpenGVLab/InternVL3_5-8B.
Zero-shot benchmark (MEDIUM): zai-org/GLM-4.6V-Flash, openbmb/MiniCPM-V-4_5, ibm-granite/granite-4.0-3b-vision.
Any model with zero-shot ≥ ~0.78 weighted → standard pipeline: SFT → GRPO → eval-3.5k → update JSON/wiki → upload to HF with full model card + charts.
Watchlist: Molmo2, Kimi-VL successors, additional Qwen3.5-VL-MoE checkpoints.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment