Daily Model Scout Report β€” 2026-05-06

#23
by msudharsanan - opened
Denali Advanced Integration org

Daily Model Scout Report β€” 2026-05-06

Scope: New VLMs released in the last ~7 days that could move the needle on our 3,500-sample hard eval (current best: qwen3-vl-8b-sft+grpo at weighted_score 0.9131).

Method: Searched HF API for image-text-to-text and visual-question-answering models created after 2026-04-29, plus targeted searches for Qwen3.6, Gemma-4, Cosmos-Reason2, LFM2.5-VL, InternVL, MiniCPM-V, Phi, Florence, PaliGemma, LLaVA, Idefics, Molmo, Moondream. Most "last 7 days" hits are community quantizations / heretic merges β€” the genuinely interesting candidates are the upstream Apache-2.0 base models from April that we have not yet benchmarked.


πŸ”₯ Within strict 7-day window (created after 2026-04-29)

1. nvidia/Cosmos-Reason2-32B β€” Relevance: MEDIUM

  • Released: 2026-04-29
  • Link: https://huggingface.co/nvidia/Cosmos-Reason2-32B
  • Architecture: Qwen3-VL-32B-Instruct fine-tune (so same family as our current best)
  • Params: 32B dense, BF16, Vision Transformer + Dense Transformer, 256K context
  • License: NVIDIA Open Model License + Apache 2.0 (commercial OK)
  • Why it might help us: Built on the same Qwen3-VL backbone we already SFT+GRPO. NVIDIA reports +2.78 pts on general reasoning vs base Qwen3-VL-32B (75.85% vs 73.07%) and very large gains on Smart Spaces (77.79 vs 47.55) and Self-Driving (70.15 vs 48.08). Tuned for physical-AI / robotics, not garments β€” but the upgrade path from our current 8B β†’ 32B with same arch is straightforward, and the post-training apparently lifted general-purpose vision reasoning too.
  • Why MEDIUM not HIGH: It's specialized for video + robotics CoT. We'd be paying for a lot of capability we don't use. Better choice for "I want a 32B Qwen3-VL backbone with extra reasoning" is to SFT directly on Qwen/Qwen3-VL-32B-Instruct ourselves. Worth a 100-sample probe before deciding.

2. Qwen3.6-VL community variants (REAP-pruned) β€” Relevance: LOW (for now)

  • Examples: keithnull/Qwen3.6-VL-REAP-26B-A3B-GGUF (2026-04-27), mattbucci/Qwen3.6-VL-REAP-26B-A3B-AWQ (2026-04-28), upstream atbender/Qwen3.6-VL-REAP-26B-A3B
  • What it is: Cerebras REAP (Router-Expert Activation Pruning) applied to Qwen/Qwen3.6-35B-A3B, dropping 256β†’192 routed experts (~25% pruning) β†’ 27B with vision encoder retained.
  • Why LOW: Community pruning + community quantizations. We should benchmark the official Qwen3.6 base first (see below) before chasing pruned forks.

🌟 Recent base-model releases worth benchmarking (April 2026 β€” fresh enough to matter)

3. Qwen/Qwen3.6-35B-A3B β€” Relevance: HIGH

  • Released: 2026-04-15 Β· Downloads: 3M Β· Likes: 1,642
  • Link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
  • Architecture: MoE, 35B total / 3B activated, 256 experts (8 routed + 1 shared), Gated DeltaNet + Gated Attention hybrid, native vision encoder, 262K context (1M w/ YaRN). Apache 2.0.
  • Pipeline tag: image-text-to-text (multimodal native).
  • Why HIGH: This is the natural successor to the Qwen3-VL-8B that's already our SOTA. With only 3B active params per token it should infer roughly at our 2B model's speed while having the representational headroom of a 35B. BF16 weights β‰ˆ 70 GB β†’ fits comfortably on RTX PRO 6000 98GB. Apache 2.0 means clean licensing for ReLo deployment. Recommend an SFT+GRPO run on the standard 7,672-row apparel-capture-8k-train pipeline and eval on the 3.5k hard set.

4. Qwen/Qwen3.6-27B β€” Relevance: HIGH

  • Released: 2026-04-21 Β· Downloads: 1.6M Β· Likes: 1,148
  • Link: https://huggingface.co/Qwen/Qwen3.6-27B
  • Architecture: Dense 27B, hidden 5120, 64 layers, hybrid Gated DeltaNet + Gated Attention, native vision encoder, 262K context, BF16. Apache 2.0. vLLM v0.19.0+ and SGLang v0.5.10+ supported.
  • Why HIGH: Dense-27B variant of the Qwen3.6-VL line β€” the closest apples-to-apples scale-up from our current 8B SOTA. Stronger reported general benchmarks than Qwen3.5-27B. BF16 β‰ˆ 54 GB β†’ fits on the RTX PRO 6000 with room to spare for activations and vision tokens. If MoE routing complicates LoRA/GRPO for #3, this is the safe-choice upgrade.

5. Qwen/Qwen3.6-27B-FP8 β€” Relevance: MEDIUM

  • Released: 2026-04-21 Β· Downloads: 2.3M
  • Link: https://huggingface.co/Qwen/Qwen3.6-27B-FP8
  • Why MEDIUM: Official FP8 of #4. Useful as the inference-side comparison point if we land Qwen3.6-27B SFT and want to benchmark FP8 vs NVFP4 quantized variants (we already have the NVFP4 pipeline).

6. Qwen/Qwen3.6-35B-A3B-FP8 β€” Relevance: MEDIUM

7. LiquidAI/LFM2.5-VL-450M β€” Relevance: LOW (but worth a probe)

  • Released: 2026-04-08 Β· Downloads: 49k Β· Likes: 168
  • Link: https://huggingface.co/LiquidAI/LFM2.5-VL-450M
  • Architecture: 450M total (Liquid backbone + SigLIP2 vision), bbox prediction, function calling, multilingual. License: "other".
  • Why LOW: Way below our quality threshold for a 9-field hard eval β€” but with bbox prediction baked in, it's interesting for the SAM3.1 defect-segmentation pipeline (project_sellability_sam3_training) as a fast pre-filter, not a replacement for our garment classifier.

🎯 Recent releases NOT in 7-day window but still un-benchmarked (March 2026)

8. google/gemma-4-31B-it β€” Relevance: MEDIUM

  • Released: 2026-03-11 Β· Downloads: 8.4M Β· Likes: 2,535
  • Link: https://huggingface.co/google/gemma-4-31B-it
  • Architecture: 30.7B dense, 60 layers, ~550M vision encoder, Apache 2.0, 256K context, multilingual (140+ langs). Configurable visual token budget (70/140/280/560/1120).
  • Benchmarks: MMMU Pro (Vision) 76.9%, MMLU Pro 85.2%, GPQA Diamond 84.3%.
  • Why MEDIUM: Different architecture family (no Liger/Unsloth path proven for us β€” see Granite Vision lessons), but Apache 2.0 dense 31B with strong vision benchmarks is a plausible second-bet if Qwen3.6-27B SFT plateaus. Configurable image-token budget could help on the 3,500-hard set where image complexity varies. Recommend keeping on the watchlist, not immediate priority.

9. google/gemma-4-26B-A4B-it β€” Relevance: LOW

  • Released: 2026-03-11 Β· Downloads: 6.7M Β· Likes: 890
  • Link: https://huggingface.co/google/gemma-4-26B-A4B-it
  • Architecture: MoE, 4B active per token, multimodal, Apache 2.0.
  • Why LOW: Same family caveat as #8, plus MoE complexity. Skip until #3 (Qwen3.6-35B-A3B) tells us how MoE routing interacts with our SFT+GRPO+GTPO pipeline.

βœ… Recommended actions

Priority Action Owner
1 Register Qwen/Qwen3.6-27B in PeakBench, run the 3.5k-hard eval as base. Myan
2 If #1 β‰₯ 0.85 weighted, kick off the standard SFT β†’ eval β†’ GRPO β†’ eval β†’ upload-to-HF pipeline. Myan
3 Register Qwen/Qwen3.6-35B-A3B in PeakBench, run the 3.5k-hard eval as base. Decide MoE-vs-dense based on relative scores. Myan
4 Quick 100-sample probe of nvidia/Cosmos-Reason2-32B to see if the post-training transfers to garment attributes at all. Skip if base β‰ͺ Qwen3-VL-32B-Instruct on our task. Myan
5 Watchlist only: gemma-4-31B-it, LFM2.5-VL-450M. Do not start training until Qwen3.6 path is exhausted. β€”

🚫 Notable absences (searched, none new in window)

  • InternVL β€” latest is InternVL3.5 from Aug 2025; no InternVL4 yet.
  • MiniCPM-V β€” latest is MiniCPM-o-4.5 / MiniCPM-V-4.5 from Feb 2026.
  • Phi-4-Multimodal β€” no new release in window.
  • Florence-3 / Florence-4 β€” no release.
  • PaliGemma3 β€” no release.
  • SmolVLM3 / Idefics4 β€” no release.
  • LLaVA-NeXT-2 / OneVision-2 β€” no release.
  • Molmo / Moondream β€” no new base in window.
  • CogVLM3 / GLM-4.5V / DeepSeek-VL3 β€” no release in window.

Scout report generated by Claude Code on behalf of Myan Sudharsanan (Innovation AI Developer II, Denali AI).

Sign up or log in to comment