GRPO Q&A Training — PleIAs/SYNTH (German)

Setup

pip install torch transformers trl datasets peft rouge-score

Reward Function

The combined reward is a weighted blend of three signals:

Component	Weight	Description
ROUGE-L	0.30	N-gram overlap with reference answer
Length similarity	0.20	Ratio of response length to reference length
LLM judge	0.50	Same model evaluates correctness (0–1 score)

Training Configuration

LoRA: rank 32, alpha 64, all-linear, dropout 0.1
Optimizer: AdamW fused, β₂=0.99, weight decay 0.1
LR schedule: Cosine with 10% warmup
Generations per prompt: 4 (GRPO rollouts)
Max completion length: 160 tokens (dynamic → 256 at step 500)
Precision: FP16

Observed Training Behaviour (Current Run)

Reward improved from ~0.33 → ~0.40 in the first 1,000 steps, then plateaued for the remaining ~17,000 steps.
Entropy collapsed from ~1.5 → ~0.2, indicating the policy became overly deterministic early on.
completions/clipped_ratio is 1.0 throughout — the model never generates an EOS token and always fills the full 160-token budget.
GRPO clip ratios are consistently 0 — the policy ratio stays inside the clip region, meaning gradient updates have near-zero effect.

Scope for Improvement

1. EoS Generation Issue

The model fills the entire completion window on every generation, never learning to stop naturally. This means the length reward is always saturated and provides no gradient signal.

Solutions:

Add an explicit EOS/termination reward: give a bonus (e.g. +0.1) when the completion ends with an EOS token and a penalty when it doesn't.
Alternatively, add a brevity penalty: subtract a small amount proportional to len(response) / max_len to discourage unnecessary length.
Lower max_completion_length significantly (e.g. 64 or 96) to force the model to be concise before scaling up.

2. Replace self-referential LLM judge with a frozen model

The current llm_judge_reward_batch uses the model being trained as its own judge (50% of the reward signal). As the policy drifts, the reward signal becomes unreliable and can lead to reward hacking — the model learns to fool itself rather than improve quality.

Solution: Load a separate, frozen judge model at startup and keep it frozen throughout.

3. Add an entropy regularization bonus

The entropy collapsed from ~1.5 to ~0.2 very early, causing the policy to become overly deterministic. GRPO supports an entropy coefficient to counteract this.

Solution: Still not clear.

4. Increase number of generations per prompt

Currently num_generations=4. GRPO relies on variance within the group to produce a learning signal. With only 4 rollouts, a low-variance reward (as seen: reward_std often < 0.01) means the advantage estimates are near zero and updates are negligible.

Increase to 8 or 16 if with calculating the GPU Requirements. (Lenovo VM)

5. Lower the learning rate

The initial LR of 5e-5 might be high for a 7B model fine-tuned with RL. A lower peak LR reduces the risk of entropy collapse and reward instability.

6. Fix or remove `AdjustContextLengthCallback`

The callback sets args.max_completion_length = 256 at step 500, but throughout the entire run completions/max_length stays at 160. The attribute mutation may not propagate to the generation loop in TRL. Either verify it is working, or remove it and set a fixed length:

7. Track ROUGE and LLM judge scores separately

Currently only combined_reward is logged. Breaking it out makes it much easier to diagnose which reward component is driving behaviour.

8. Add a periodic qualitative sample log

Log a few decoded completions every N steps to inspect what the model is actually generating:

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support