Daily Model Scout Report — 2026-05-06

#23

by msudharsanan - opened 9 days ago

Denali Advanced Integration org 9 days ago

Daily Model Scout Report — 2026-05-06

Scope: New VLMs released in the last ~7 days that could move the needle on our 3,500-sample hard eval (current best: qwen3-vl-8b-sft+grpo at weighted_score 0.9131).

Method: Searched HF API for image-text-to-text and visual-question-answering models created after 2026-04-29, plus targeted searches for Qwen3.6, Gemma-4, Cosmos-Reason2, LFM2.5-VL, InternVL, MiniCPM-V, Phi, Florence, PaliGemma, LLaVA, Idefics, Molmo, Moondream. Most "last 7 days" hits are community quantizations / heretic merges — the genuinely interesting candidates are the upstream Apache-2.0 base models from April that we have not yet benchmarked.

🔥 Within strict 7-day window (created after 2026-04-29)

1. nvidia/Cosmos-Reason2-32B — Relevance: MEDIUM

Released: 2026-04-29
Link: https://huggingface.co/nvidia/Cosmos-Reason2-32B
Architecture: Qwen3-VL-32B-Instruct fine-tune (so same family as our current best)
Params: 32B dense, BF16, Vision Transformer + Dense Transformer, 256K context
License: NVIDIA Open Model License + Apache 2.0 (commercial OK)
Why it might help us: Built on the same Qwen3-VL backbone we already SFT+GRPO. NVIDIA reports +2.78 pts on general reasoning vs base Qwen3-VL-32B (75.85% vs 73.07%) and very large gains on Smart Spaces (77.79 vs 47.55) and Self-Driving (70.15 vs 48.08). Tuned for physical-AI / robotics, not garments — but the upgrade path from our current 8B → 32B with same arch is straightforward, and the post-training apparently lifted general-purpose vision reasoning too.
Why MEDIUM not HIGH: It's specialized for video + robotics CoT. We'd be paying for a lot of capability we don't use. Better choice for "I want a 32B Qwen3-VL backbone with extra reasoning" is to SFT directly on Qwen/Qwen3-VL-32B-Instruct ourselves. Worth a 100-sample probe before deciding.

2. Qwen3.6-VL community variants (REAP-pruned) — Relevance: LOW (for now)

Examples: keithnull/Qwen3.6-VL-REAP-26B-A3B-GGUF (2026-04-27), mattbucci/Qwen3.6-VL-REAP-26B-A3B-AWQ (2026-04-28), upstream atbender/Qwen3.6-VL-REAP-26B-A3B
What it is: Cerebras REAP (Router-Expert Activation Pruning) applied to Qwen/Qwen3.6-35B-A3B, dropping 256→192 routed experts (~25% pruning) → 27B with vision encoder retained.
Why LOW: Community pruning + community quantizations. We should benchmark the official Qwen3.6 base first (see below) before chasing pruned forks.

🌟 Recent base-model releases worth benchmarking (April 2026 — fresh enough to matter)

3. Qwen/Qwen3.6-35B-A3B — Relevance: HIGH

Released: 2026-04-15 · Downloads: 3M · Likes: 1,642
Link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
Architecture: MoE, 35B total / 3B activated, 256 experts (8 routed + 1 shared), Gated DeltaNet + Gated Attention hybrid, native vision encoder, 262K context (1M w/ YaRN). Apache 2.0.
Pipeline tag: image-text-to-text (multimodal native).
Why HIGH: This is the natural successor to the Qwen3-VL-8B that's already our SOTA. With only 3B active params per token it should infer roughly at our 2B model's speed while having the representational headroom of a 35B. BF16 weights ≈ 70 GB → fits comfortably on RTX PRO 6000 98GB. Apache 2.0 means clean licensing for ReLo deployment. Recommend an SFT+GRPO run on the standard 7,672-row apparel-capture-8k-train pipeline and eval on the 3.5k hard set.

4. Qwen/Qwen3.6-27B — Relevance: HIGH

Released: 2026-04-21 · Downloads: 1.6M · Likes: 1,148
Link: https://huggingface.co/Qwen/Qwen3.6-27B
Architecture: Dense 27B, hidden 5120, 64 layers, hybrid Gated DeltaNet + Gated Attention, native vision encoder, 262K context, BF16. Apache 2.0. vLLM v0.19.0+ and SGLang v0.5.10+ supported.
Why HIGH: Dense-27B variant of the Qwen3.6-VL line — the closest apples-to-apples scale-up from our current 8B SOTA. Stronger reported general benchmarks than Qwen3.5-27B. BF16 ≈ 54 GB → fits on the RTX PRO 6000 with room to spare for activations and vision tokens. If MoE routing complicates LoRA/GRPO for #3, this is the safe-choice upgrade.

5. Qwen/Qwen3.6-27B-FP8 — Relevance: MEDIUM

Released: 2026-04-21 · Downloads: 2.3M
Link: https://huggingface.co/Qwen/Qwen3.6-27B-FP8
Why MEDIUM: Official FP8 of #4. Useful as the inference-side comparison point if we land Qwen3.6-27B SFT and want to benchmark FP8 vs NVFP4 quantized variants (we already have the NVFP4 pipeline).

6. Qwen/Qwen3.6-35B-A3B-FP8 — Relevance: MEDIUM

Released: 2026-04-15 · Downloads: 2.8M
Link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8
Why MEDIUM: Official FP8 of #3. Same logic as above for MoE variant.

7. LiquidAI/LFM2.5-VL-450M — Relevance: LOW (but worth a probe)

Released: 2026-04-08 · Downloads: 49k · Likes: 168
Link: https://huggingface.co/LiquidAI/LFM2.5-VL-450M
Architecture: 450M total (Liquid backbone + SigLIP2 vision), bbox prediction, function calling, multilingual. License: "other".
Why LOW: Way below our quality threshold for a 9-field hard eval — but with bbox prediction baked in, it's interesting for the SAM3.1 defect-segmentation pipeline (project_sellability_sam3_training) as a fast pre-filter, not a replacement for our garment classifier.

🎯 Recent releases NOT in 7-day window but still un-benchmarked (March 2026)

8. google/gemma-4-31B-it — Relevance: MEDIUM

Released: 2026-03-11 · Downloads: 8.4M · Likes: 2,535
Link: https://huggingface.co/google/gemma-4-31B-it
Architecture: 30.7B dense, 60 layers, ~550M vision encoder, Apache 2.0, 256K context, multilingual (140+ langs). Configurable visual token budget (70/140/280/560/1120).
Benchmarks: MMMU Pro (Vision) 76.9%, MMLU Pro 85.2%, GPQA Diamond 84.3%.
Why MEDIUM: Different architecture family (no Liger/Unsloth path proven for us — see Granite Vision lessons), but Apache 2.0 dense 31B with strong vision benchmarks is a plausible second-bet if Qwen3.6-27B SFT plateaus. Configurable image-token budget could help on the 3,500-hard set where image complexity varies. Recommend keeping on the watchlist, not immediate priority.

9. google/gemma-4-26B-A4B-it — Relevance: LOW

Released: 2026-03-11 · Downloads: 6.7M · Likes: 890
Link: https://huggingface.co/google/gemma-4-26B-A4B-it
Architecture: MoE, 4B active per token, multimodal, Apache 2.0.
Why LOW: Same family caveat as #8, plus MoE complexity. Skip until #3 (Qwen3.6-35B-A3B) tells us how MoE routing interacts with our SFT+GRPO+GTPO pipeline.

✅ Recommended actions

Priority	Action	Owner
1	Register Qwen/Qwen3.6-27B in PeakBench, run the 3.5k-hard eval as base.	Myan
2	If #1 ≥ 0.85 weighted, kick off the standard SFT → eval → GRPO → eval → upload-to-HF pipeline.	Myan
3	Register Qwen/Qwen3.6-35B-A3B in PeakBench, run the 3.5k-hard eval as base. Decide MoE-vs-dense based on relative scores.	Myan
4	Quick 100-sample probe of nvidia/Cosmos-Reason2-32B to see if the post-training transfers to garment attributes at all. Skip if base ≪ Qwen3-VL-32B-Instruct on our task.	Myan
5	Watchlist only: gemma-4-31B-it, LFM2.5-VL-450M. Do not start training until Qwen3.6 path is exhausted.	—

🚫 Notable absences (searched, none new in window)

InternVL — latest is InternVL3.5 from Aug 2025; no InternVL4 yet.
MiniCPM-V — latest is MiniCPM-o-4.5 / MiniCPM-V-4.5 from Feb 2026.
Phi-4-Multimodal — no new release in window.
Florence-3 / Florence-4 — no release.
PaliGemma3 — no release.
SmolVLM3 / Idefics4 — no release.
LLaVA-NeXT-2 / OneVision-2 — no release.
Molmo / Moondream — no new base in window.
CogVLM3 / GLM-4.5V / DeepSeek-VL3 — no release in window.

Scout report generated by Claude Code on behalf of Myan Sudharsanan (Innovation AI Developer II, Denali AI).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment