Daily Model Scout Report β 2026-05-06
#23
by msudharsanan - opened
Daily Model Scout Report β 2026-05-06
Scope: New VLMs released in the last ~7 days that could move the needle on our 3,500-sample hard eval (current best: qwen3-vl-8b-sft+grpo at weighted_score 0.9131).
Method: Searched HF API for image-text-to-text and visual-question-answering models created after 2026-04-29, plus targeted searches for Qwen3.6, Gemma-4, Cosmos-Reason2, LFM2.5-VL, InternVL, MiniCPM-V, Phi, Florence, PaliGemma, LLaVA, Idefics, Molmo, Moondream. Most "last 7 days" hits are community quantizations / heretic merges β the genuinely interesting candidates are the upstream Apache-2.0 base models from April that we have not yet benchmarked.
π₯ Within strict 7-day window (created after 2026-04-29)
1. nvidia/Cosmos-Reason2-32B β Relevance: MEDIUM
- Released: 2026-04-29
- Link: https://huggingface.co/nvidia/Cosmos-Reason2-32B
- Architecture: Qwen3-VL-32B-Instruct fine-tune (so same family as our current best)
- Params: 32B dense, BF16, Vision Transformer + Dense Transformer, 256K context
- License: NVIDIA Open Model License + Apache 2.0 (commercial OK)
- Why it might help us: Built on the same Qwen3-VL backbone we already SFT+GRPO. NVIDIA reports +2.78 pts on general reasoning vs base Qwen3-VL-32B (75.85% vs 73.07%) and very large gains on Smart Spaces (77.79 vs 47.55) and Self-Driving (70.15 vs 48.08). Tuned for physical-AI / robotics, not garments β but the upgrade path from our current 8B β 32B with same arch is straightforward, and the post-training apparently lifted general-purpose vision reasoning too.
- Why MEDIUM not HIGH: It's specialized for video + robotics CoT. We'd be paying for a lot of capability we don't use. Better choice for "I want a 32B Qwen3-VL backbone with extra reasoning" is to SFT directly on
Qwen/Qwen3-VL-32B-Instructourselves. Worth a 100-sample probe before deciding.
2. Qwen3.6-VL community variants (REAP-pruned) β Relevance: LOW (for now)
- Examples:
keithnull/Qwen3.6-VL-REAP-26B-A3B-GGUF(2026-04-27),mattbucci/Qwen3.6-VL-REAP-26B-A3B-AWQ(2026-04-28), upstreamatbender/Qwen3.6-VL-REAP-26B-A3B - What it is: Cerebras REAP (Router-Expert Activation Pruning) applied to
Qwen/Qwen3.6-35B-A3B, dropping 256β192 routed experts (~25% pruning) β 27B with vision encoder retained. - Why LOW: Community pruning + community quantizations. We should benchmark the official Qwen3.6 base first (see below) before chasing pruned forks.
π Recent base-model releases worth benchmarking (April 2026 β fresh enough to matter)
3. Qwen/Qwen3.6-35B-A3B β Relevance: HIGH
- Released: 2026-04-15 Β· Downloads: 3M Β· Likes: 1,642
- Link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
- Architecture: MoE, 35B total / 3B activated, 256 experts (8 routed + 1 shared), Gated DeltaNet + Gated Attention hybrid, native vision encoder, 262K context (1M w/ YaRN). Apache 2.0.
- Pipeline tag:
image-text-to-text(multimodal native). - Why HIGH: This is the natural successor to the Qwen3-VL-8B that's already our SOTA. With only 3B active params per token it should infer roughly at our 2B model's speed while having the representational headroom of a 35B. BF16 weights β 70 GB β fits comfortably on RTX PRO 6000 98GB. Apache 2.0 means clean licensing for ReLo deployment. Recommend an SFT+GRPO run on the standard 7,672-row apparel-capture-8k-train pipeline and eval on the 3.5k hard set.
4. Qwen/Qwen3.6-27B β Relevance: HIGH
- Released: 2026-04-21 Β· Downloads: 1.6M Β· Likes: 1,148
- Link: https://huggingface.co/Qwen/Qwen3.6-27B
- Architecture: Dense 27B, hidden 5120, 64 layers, hybrid Gated DeltaNet + Gated Attention, native vision encoder, 262K context, BF16. Apache 2.0. vLLM v0.19.0+ and SGLang v0.5.10+ supported.
- Why HIGH: Dense-27B variant of the Qwen3.6-VL line β the closest apples-to-apples scale-up from our current 8B SOTA. Stronger reported general benchmarks than Qwen3.5-27B. BF16 β 54 GB β fits on the RTX PRO 6000 with room to spare for activations and vision tokens. If MoE routing complicates LoRA/GRPO for #3, this is the safe-choice upgrade.
5. Qwen/Qwen3.6-27B-FP8 β Relevance: MEDIUM
- Released: 2026-04-21 Β· Downloads: 2.3M
- Link: https://huggingface.co/Qwen/Qwen3.6-27B-FP8
- Why MEDIUM: Official FP8 of #4. Useful as the inference-side comparison point if we land Qwen3.6-27B SFT and want to benchmark FP8 vs NVFP4 quantized variants (we already have the NVFP4 pipeline).
6. Qwen/Qwen3.6-35B-A3B-FP8 β Relevance: MEDIUM
- Released: 2026-04-15 Β· Downloads: 2.8M
- Link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8
- Why MEDIUM: Official FP8 of #3. Same logic as above for MoE variant.
7. LiquidAI/LFM2.5-VL-450M β Relevance: LOW (but worth a probe)
- Released: 2026-04-08 Β· Downloads: 49k Β· Likes: 168
- Link: https://huggingface.co/LiquidAI/LFM2.5-VL-450M
- Architecture: 450M total (Liquid backbone + SigLIP2 vision), bbox prediction, function calling, multilingual. License: "other".
- Why LOW: Way below our quality threshold for a 9-field hard eval β but with bbox prediction baked in, it's interesting for the SAM3.1 defect-segmentation pipeline (project_sellability_sam3_training) as a fast pre-filter, not a replacement for our garment classifier.
π― Recent releases NOT in 7-day window but still un-benchmarked (March 2026)
8. google/gemma-4-31B-it β Relevance: MEDIUM
- Released: 2026-03-11 Β· Downloads: 8.4M Β· Likes: 2,535
- Link: https://huggingface.co/google/gemma-4-31B-it
- Architecture: 30.7B dense, 60 layers, ~550M vision encoder, Apache 2.0, 256K context, multilingual (140+ langs). Configurable visual token budget (70/140/280/560/1120).
- Benchmarks: MMMU Pro (Vision) 76.9%, MMLU Pro 85.2%, GPQA Diamond 84.3%.
- Why MEDIUM: Different architecture family (no Liger/Unsloth path proven for us β see Granite Vision lessons), but Apache 2.0 dense 31B with strong vision benchmarks is a plausible second-bet if Qwen3.6-27B SFT plateaus. Configurable image-token budget could help on the 3,500-hard set where image complexity varies. Recommend keeping on the watchlist, not immediate priority.
9. google/gemma-4-26B-A4B-it β Relevance: LOW
- Released: 2026-03-11 Β· Downloads: 6.7M Β· Likes: 890
- Link: https://huggingface.co/google/gemma-4-26B-A4B-it
- Architecture: MoE, 4B active per token, multimodal, Apache 2.0.
- Why LOW: Same family caveat as #8, plus MoE complexity. Skip until #3 (Qwen3.6-35B-A3B) tells us how MoE routing interacts with our SFT+GRPO+GTPO pipeline.
β Recommended actions
| Priority | Action | Owner |
|---|---|---|
| 1 | Register Qwen/Qwen3.6-27B in PeakBench, run the 3.5k-hard eval as base. | Myan |
| 2 | If #1 β₯ 0.85 weighted, kick off the standard SFT β eval β GRPO β eval β upload-to-HF pipeline. | Myan |
| 3 | Register Qwen/Qwen3.6-35B-A3B in PeakBench, run the 3.5k-hard eval as base. Decide MoE-vs-dense based on relative scores. | Myan |
| 4 | Quick 100-sample probe of nvidia/Cosmos-Reason2-32B to see if the post-training transfers to garment attributes at all. Skip if base βͺ Qwen3-VL-32B-Instruct on our task. | Myan |
| 5 | Watchlist only: gemma-4-31B-it, LFM2.5-VL-450M. Do not start training until Qwen3.6 path is exhausted. | β |
π« Notable absences (searched, none new in window)
- InternVL β latest is InternVL3.5 from Aug 2025; no InternVL4 yet.
- MiniCPM-V β latest is MiniCPM-o-4.5 / MiniCPM-V-4.5 from Feb 2026.
- Phi-4-Multimodal β no new release in window.
- Florence-3 / Florence-4 β no release.
- PaliGemma3 β no release.
- SmolVLM3 / Idefics4 β no release.
- LLaVA-NeXT-2 / OneVision-2 β no release.
- Molmo / Moondream β no new base in window.
- CogVLM3 / GLM-4.5V / DeepSeek-VL3 β no release in window.
Scout report generated by Claude Code on behalf of Myan Sudharsanan (Innovation AI Developer II, Denali AI).