TAPS: Task Aware Proposal Distributions for Speculative Sampling
Abstract
Speculative decoding effectiveness depends on draft model training data alignment with downstream tasks, with specialized drafters performing better when combined through confidence-based routing rather than simple averaging.
Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.
Community
Task-aware speculative decoding with specialized draft models and inference-time composition (routing and merged trees) to improve acceptance length across domains while preserving the target distribution.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ConFu: Contemplate the Future for Better Speculative Sampling (2026)
- DFlash: Block Diffusion for Flash Speculative Decoding (2026)
- Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding (2026)
- Training-free Dropout Sampling for Semantic Token Acceptance in Speculative Decoding (2026)
- P-EAGLE: Parallel-Drafting EAGLE with Scalable Training (2026)
- MoE-Spec: Expert Budgeting for Efficient Speculative Decoding (2026)
- SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
confidence beats entropy as the routing signal for combining specialized drafters, and inference-time composition like merged-tree verification seems to be the real win here. that lines up with intuition that the verifier cares about tokens the draft is confident about and doubts the rest, but how do they calibrate draft confidence across domains and temperatures? the merged-tree verification delivering the highest acceptance length is neat, but the exact tree construction and pruning thresholds feel crucial and underexplained. btw the arxivlens breakdown helped me parse the method details, especially the distinction between task-aligned training and at-inference composition, here’s a handy overview: https://arxivlens.com/PaperView/Details/taps-task-aware-proposal-distributions-for-speculative-sampling-7881-e1590e1e. a small follow-up ablation on routing thresholds under skewed data would help confirm that the gains generalize beyond the balanced benchmarks.
Models citing this paper 10
Browse 10 models citing this paperDatasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper