Mathstral-7B · Mismatched × Wrong drafts

Headline model — the mismatched-wrong configuration from the paper.

LoRA adapter for mistralai/Mathstral-7B-v0.1, trained with Dr. GRPO as the Mismatched × Wrong drafts condition in "Weak-to-Strong Elicitation via Mismatched Wrong Drafts" (Wei Deng, arXiv:2605.17314).

Base model: mistralai/Mathstral-7B-v0.1 (Apache-2.0)
Draft model: Qwen/Qwen2.5-Math-1.5B (writes the training-time draft)
Adapter: LoRA r=16, α=32 (167 MB), released at global step 2000
Training data: config mismatched_wrong of hugruby/mismatched-wrong-drafts — 8,888 Level 3–5 MATH problems (MATH-500 held out)
License: Apache-2.0

How to use

This is a LoRA adapter — load it on top of the base model.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE    = "mistralai/Mathstral-7B-v0.1"
ADAPTER = "hugruby/mathstral-7b-mismatched-wrong-drafts"

tok   = AutoTokenizer.from_pretrained(ADAPTER)
model = PeftModel.from_pretrained(
    AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto"),
    ADAPTER,
).eval()

problem = "If $x+y=6$ and $xy=5$, find $x^2+y^2$."
gen = dict(max_new_tokens=4096, do_sample=False)

# CANONICAL — the plain draft-free prompt the model was trained and evaluated on (no [INST]):
PROMPT = (
    "Problem: " + problem + "\n\n"
    "Thinking: N/A\n\n"
    "The thinking section may contain errors. Solve the math problem step by step. "
    "Write your own correct solution. Put your final answer within \\boxed{}.\n\n"
    "Correct Solution:"
)
ids = tok(PROMPT, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**ids, **gen)[0][ids.input_ids.shape[1]:], skip_special_tokens=True))

Optional: the `[INST]` chat format (out-of-distribution)

The shipped chat_template.jinja is Mathstral's original [INST] chat template. This adapter was not trained in that format, so apply_chat_template(...) is out-of-distribution and generally underperforms the plain prompt above — it is included only so you can A/B both:

ids = tok.apply_chat_template(
    [{"role": "user",
      "content": problem + "\n\nPlease reason step by step, and put your final answer within \\boxed{}."}],
    add_generation_prompt=True, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(ids, **gen)[0][ids.shape[1]:], skip_special_tokens=True))

How it was trained

Trained with Dr. GRPO (loss_type=dr_grpo, scale_rewards=False) using TRL GRPOTrainer on top of Unsloth FastLanguageModel, on the mismatched_wrong data config. The reward is binary mathematically_quasi_correct. The correction-bonus, copy-penalty, and corrupt-penalty terms are all 0, and the reward is pure binary.

Training command:

python scripts/train.py \
  --model mistralai/Mathstral-7B-v0.1 \
  --dataset-path data/mismatched_wrong \
  --output-dir outputs/mismatched_wrong \
  --max-steps 2222 \
  --gradient-accumulation-steps 4 \
  --max-completion-length 4096 \
  --max-seq-length 7168 \
  --max-prompt-tokens 3072 \
  --learning-rate 5e-6 --lr-scheduler-type constant \
  --beta 0 \
  --correction-bonus 0.0 --copy-penalty 0.0 --corrupt-penalty 0.0 \
  --adam-beta2 0.99 \
  --save-steps 50 --gpu-mem-util 0.5

Hyperparameter	Value
Base model	`mistralai/Mathstral-7B-v0.1`
Method	Dr. GRPO (`loss_type=dr_grpo`, `scale_rewards=False`)
LoRA rank / alpha	r = 16, α = 32 → scaling γ = α/r = 2
LoRA targets / dropout	`q,k,v,o,gate,up,down` (7 projections) / 0.0
KL coefficient β	0
Reward bonuses	correction 0, copy-penalty 0, corrupt-penalty 0
Generations per prompt	16
Per-device batch	1
Gradient accumulation	4 → 4 problems × 16 = 64 completions/step
Learning rate	5e-6, constant schedule
Adam β₂	0.99
Max completion length	4096
Max sequence length	7168
Max prompt tokens	3072
Max steps	2222
Released checkpoint	global step 2000 (epoch 0.900)
Random seed	42

Length budgets across all four variants:

Variant	max-seq-length	max-completion	max-prompt-tokens
mismatched-wrong	7168	4096	3072
matched-wrong	7168	4096	3072
no-draft	7168	4096	disabled (but it is equivalent to 3,072, as all prompts are short and no truncation)
mismatched-correct	8192	4096	disabled

For a strict apple-to-apple comparison, mismatched-correct should have used --max-seq-length 7168 and --max-prompt-tokens 3072 like the other three variants; the larger 8,192 with the max-prompt-tokens cap left off was an omission. The effect should be negligible though — only 6 of 8,888 prompts exceed 3,072 tokens (longest 3,317), and for the other 8,882 the run is identical to a 7,168 / 3,072 setup. For those 6 the prompt is left untruncated, but the 4,096 max-completion length is still respected and max-seq-length runs only slightly past 7,168 (at most 3,317 + 4,096 = 7,413, well under 8,192). But to train a precise apple-to-apple version yourself, for mismatched-correct, change --max-seq-length 8192 to 7168 and add --max-prompt-tokens 3072. For the no-draft variant, it is better to add --max-prompt-tokens 3072 explicitly as well.

Files

adapter_model.safetensors, adapter_config.json — the LoRA adapter (load with PEFT on the base model)
tokenizer.json, tokenizer.model, tokenizer_config.json, special_tokens_map.json — tokenizer
chat_template.jinja — Mathstral's [INST] template (see the out-of-distribution note above)

Citation

@article{deng2026mismatched,
  title   = {Weak-to-Strong Elicitation via Mismatched Wrong Drafts},
  author  = {Deng, Wei},
  journal = {arXiv preprint arXiv:2605.17314},
  year    = {2026},
  url     = {https://arxiv.org/abs/2605.17314}
}

License

Apache-2.0. The base model (Mathstral-7B-v0.1) and the draft model (Qwen2.5-Math-1.5B) are both Apache-2.0.

Downloads last month: 43

Model tree for hugruby/mathstral-7b-mismatched-wrong-drafts

Base model

mistralai/Mathstral-7B-v0.1

Adapter

(8)

this model

Dataset used to train hugruby/mathstral-7b-mismatched-wrong-drafts

Collection including hugruby/mathstral-7b-mismatched-wrong-drafts

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Collection

Four 2x2-ablation LoRA adapters of Mathstral-7B + the 8,888-problem MATH dataset (arXiv:2605.17314). • 5 items • Updated 21 days ago

Paper for hugruby/mathstral-7b-mismatched-wrong-drafts