Reliable Chain-of-Thought via Prefix Consistency
Abstract
Prefix consistency uses answer reproduction rates under trace regeneration to weight candidate responses, achieving high accuracy with significantly fewer tokens than standard majority voting.
Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT partway through and regenerate the remainder, we observe that traces with correct answers reproduce their original answer more often than traces with wrong answers. We use this difference as a reliability signal, prefix consistency, that weights each candidate answer by how often it reappears under regeneration. It requires no access to token log-probabilities or self-rating prompts. Across five reasoning models and four math and science benchmarks, prefix consistency is the best correctness predictor in most settings, and reweighting votes by it reaches Standard MV plateau accuracy at up to 21x fewer tokens (median 4.6x). Our code is available at https://github.com/naoto-iwase/prefix-consistency.
Community
TL;DR: Correct Chain-of-Thought traces reproduce their answer under prefix regeneration more often than wrong ones, and weighting majority voting by this prefix consistency reaches plateau accuracy at up to 21x fewer tokens (median 4.6x).
High-ROI. I like it.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- The Detection-Extraction Gap: Models Know the Answer Before They Can Say It (2026)
- Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces (2026)
- GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering (2026)
- How Uncertainty Estimation Scales with Sampling in Reasoning Models (2026)
- Efficient Test-Time Scaling via Temporal Reasoning Aggregation (2026)
- Confidence-Aware Alignment Makes Reasoning LLMs More Reliable (2026)
- SCATR: Simple Calibrated Test-Time Ranking (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.07654 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper