Delphi-25B-SimpleRL-Math

GRPO RL fine-tune of the Marin Delphi-25B base model (marin-community/delphi-1e23-25Bparams-628Btokens, Qwen3 architecture, compute-optimal 1e23-FLOP point) using the SimpleRL-Zoo zero-RL recipe on hard competition math (hkust-nlp simplelr_qwen_level3to5, MATH levels 3-5).

This is the step-160 checkpoint (20 epochs). Training reward on the hard-math distribution rose from ~4% to ~29%.

Recipe

GRPO, rule-based boxed-answer reward (math_verify), kl_loss_coef=1e-3, rollout.n=8, lr 5e-7, top_p 1.0.
Rollout temperature 0.5 (not the recipe default of 1.0 — a sweep showed the reward signal collapses above ~0.5 for this weak base on hard math).
Seeded with a 3-shot prompt (a base model needs demonstrations, not a bare instruction). The model expects this exact prompt format and stops at <|im_end|>.

Note: the Delphi tokenizer is Llama-3 based, so <|im_start|>/<|im_end|> are literal text, not special tokens. Feed them as raw strings and stop generation at the <|im_end|> string.

Prompt format (fixed 3-shot prefix, then your problem)

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
A positive integer less than 100 leaves remainder 1 when divided by 6 and remainder 4 when divided by 7. Find the sum of all such integers.
Please reason step by step, and put your final answer within \boxed{}.<|im_end|>
<|im_start|>assistant
We need x ≡ 1 (mod 6) and x ≡ 4 (mod 7). The numbers that are 1 mod 6 are 1, 7, 13, 19, 25, ...; the smallest of these that is also 4 mod 7 is 25, since 25 = 6·4+1 and 25 = 7·3+4. Solutions repeat every lcm(6,7) = 42, so the values below 100 are 25 and 67. Their sum is 25 + 67 = 92.
The final answer is \boxed{92}.<|im_end|>
<|im_start|>user
A club has 4 times as many seniors as juniors and twice as many sophomores as juniors. If there are 42 members in total and only these three groups, how many seniors are there?
Please reason step by step, and put your final answer within \boxed{}.<|im_end|>
<|im_start|>assistant
Let the number of juniors be x. Then the number of seniors is 4x and the number of sophomores is 2x. The total is x + 4x + 2x = 7x = 42, so x = 6. The number of seniors is 4x = 24.
The final answer is \boxed{24}.<|im_end|>
<|im_start|>user
A right square pyramid has base edge 10 and lateral edge 13. Find the total area of its four triangular faces.
Please reason step by step, and put your final answer within \boxed{}.<|im_end|>
<|im_start|>assistant
The slant height runs from the apex to the midpoint of a base edge, not to a corner. Half the base edge is 5 and the lateral edge is 13, so the slant height is √(13² − 5²) = √(169 − 25) = √144 = 12. Each triangular face has area ½ · 10 · 12 = 60, and there are four faces, so the total area is 4 · 60 = 240.
The final answer is \boxed{240}.<|im_end|><|im_start|>user
{your problem}
Please reason step by step, and put your final answer within \boxed{}.<|im_end|>
<|im_start|>assistant

Example problems (illustrative of the reliable type; not guaranteed)

Append each to the 3-shot prefix above. Sampling: greedy, stop=<|im_end|>.

Problem	Answer
What is the remainder when $2^{10}$ is divided by 7?	2
A right square pyramid has base edge 6 and lateral edge 5. Find the total area of its four triangular faces.	48
A right triangle has legs 9 and 12. What is the length of the hypotenuse?	15
What is 15% of 80?	12
A rectangle has length 8 and width 5. What is its area?	40

Capability notes (honest)

RL gave a weak base genuine but narrow competence, and reasoning in the correct shape (steps -> single \boxed{} answer -> clean stop). It is reliable on problems solvable by a single operation or one known formula with small/obvious inputs (e.g. a Pythagorean step, a short modular cycle, a percentage). It is unreliable whenever it must (a) track/combine several quantities (multi-term word problems) or (b) discover something, such as a non-trivial factorization — in these cases it tends to confabulate a relation or factorization and not check it. ~29% on hard (level 3-5) MATH reflects this. It is a demonstration that the SimpleRL recipe works mechanically on a weak base; it is not a strong general math model.

Downloads last month: 39

Safetensors

Model size

25B params

Tensor type

F32

Video Preview

Reinforcement Learning

Model tree for WillHeld/Delphi-25B-SimpleRL-Math

Base model

marin-community/delphi-1e23-25Bparams-628Btokens

Finetuned

(1)

this model

Quantizations

2 models