Daily Model Scout Report β 2026-04-07
#6
by msudharsanan - opened
Daily Model Scout Report β 2026-04-07
Scope: New / recently-active VLMs on HuggingFace (last ~7 days), evaluated for our 9-field garment JSON extraction task. Hardware: RTX PRO 6000 (98 GB).
Current bests on the 3,500-sample hard eval (_overall.weighted_score):
- qwen3-vl-8b-sft+grpo β 0.9131 (best overall)
- qwen3-vl-8b-sft-grpo-nvfp4 β 0.8945 (best quantized)
- qwen3-vl-2b-sft-grpo-v9 β 0.8948 (best small)
- qwen35-2b-base β 0.8437 (best Qwen3.5 base)
HIGH relevance β benchmark this week
1. Qwen3.5-VL family
- HF: https://huggingface.co/collections/Qwen/qwen3-vl
- Sizes: 0.8B, 2B, 4B, 9B dense; 35B-A3B and 122B-A10B MoE; 397B-A17B flagship.
- Architecture: Gated Delta Networks + sparse MoE, early-fusion vision tokens, 262K native context, 201 languages.
- Why: Most likely candidate to beat 0.9131 with our existing SFT+GRPO recipe. 9B dense fits trivially; 35B-A3B MoE activates only 3B/token and also fits.
2. InternVL3.5-8B
- HF: https://huggingface.co/OpenGVLab/InternVL3_5-8B
- InternViT-300M + Qwen3-8B backbone. 38B / 78B siblings exist.
- Why: Direct size match to qwen3-vl-8b-sft+grpo. Best non-Qwen 8B option for fine-grained attribute extraction.
MEDIUM
3. IBM Granite 4.0 3B Vision
- HF: https://huggingface.co/ibm-granite/granite-4.0-3b-vision
- 3.5B base + 0.5B LoRA adapter. Apache-2.0.
- DeepStack injection; trained for table / chart / KVP extraction β closer to our 9-field JSON than generic VQA. 85.5% in-domain zero-shot exact-match; #3 in 2β4B class on VAREX.
- Risk: document-centric pretraining may not transfer to natural garment photography. Cheap to probe.
4. GLM-4.6V-Flash (9B)
- HF: https://huggingface.co/zai-org/GLM-4.6V-Flash (106B sibling: https://huggingface.co/zai-org/GLM-4.6V)
- Multimodal reasoning, native tool use, 128K context, scalable RL pretraining. Native
transformerssupport. - Why: Different lineage than Qwen β ensemble diversity. GLM-4.5V already showed strong structured-extraction results.
5. MiniCPM-V 4.5 (8B)
- HF: https://huggingface.co/openbmb/MiniCPM-V-4_5
- Qwen3-8B + SigLIP2-400M encoder.
- Why: SigLIP2 is one of the strongest open encoders for fine-grained color/pattern. Same LLM as InternVL3.5-8B β clean encoder ablation.
LOW
- Molmo2 β listed in 2026 OS-VLM survey, no fresh HF checkpoints in the last 7 days. Watchlist.
- Fashion / clothing fine-tunes β none new this week. FashionCLIP 2.0 /
EMaghakyan/fashion-clipare CLIP-class embedders only β possible re-ranker / auxiliary loss.
Recommended actions
- Zero-shot benchmark on 3.5k-hard (HIGH):
Qwen/Qwen3.5-VL-2B,Qwen/Qwen3.5-VL-9B,OpenGVLab/InternVL3_5-8B. - Zero-shot benchmark (MEDIUM):
zai-org/GLM-4.6V-Flash,openbmb/MiniCPM-V-4_5,ibm-granite/granite-4.0-3b-vision. - Any model with zero-shot β₯ ~0.78 weighted β standard pipeline: SFT β GRPO β eval-3.5k β update JSON/wiki β upload to HF with full model card + charts.
- Watchlist: Molmo2, Kimi-VL successors, additional Qwen3.5-VL-MoE checkpoints.