MR_midtrain_9B_v3

v3 meta-reasoning SFT of Qwen3.5-9B (checkpoint global_step_4500). A single model runs the full meta-reasoning loop — MR (propose exploration directions) → E (execute each direction, emit a summary) → FA (write the final answer) — using custom <direction> / <summary> special tokens (vocab 248320).

  • Architecture: Qwen3_5ForConditionalGeneration — the SFT'd text weights hosted in the base conditional-generation arch (text weights live under text_config; the vision tower is carried but unused/frozen for this text-only model). This is the form that loads directly in both vLLM serving and verl Megatron (mbridge) RL. The original text-only Qwen3_5ForCausalLM export remains recoverable from this repo's git history (commit 072c7a3).
  • Base: Qwen/Qwen3.5-9B. Scaffold: inference.mrv3 (single model; per-layer fan-out; termination = an MR step proposes no further directions → separate FA step).

Evaluation (GPT-5.4 rubric judge unless noted)

bench metric score
SODA2026 mean 0.478
IMO ProofBench pass@1 / best@3 0.547 / 0.678 (official Gemini-3.1-Pro judge: 0.490)
physics_papers pass@1 0.679

v3 SFT lands at/above the best v2 RL checkpoint on all three benches before any v3 RL.

Serving / training note

This repo hosts the conditional-generation arch (Qwen3_5ForConditionalGeneration) directly: the text-only Qwen3_5ForCausalLM export does not load in vLLM 0.19.1's hybrid KV-cache unification, and verl's Megatron/mbridge path does not recognize the text-only qwen3_5_text model type — both want this condgen form. Use temperature 1.0 (0.6 degenerates the E step into repetition loops).

Downloads last month
16
Safetensors
Model size
10B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HerrHruby/MR_midtrain_9B_v3

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(446)
this model