Papers
arxiv:2605.08632

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

Published on May 9
Authors:
,
,
,
,
,

Abstract

PARD-2 is a dual-mode speculative decoding framework that improves LLM inference speed by optimizing draft model training to maximize token acceptance length rather than prediction accuracy.

Speculative decoding accelerates Large Language Models (LLMs) inference by using a lightweight draft model to propose candidate tokens that are verified in parallel by the target model. However, existing draft model training objectives are not directly aligned with the inference-time goal of maximizing consecutive token acceptance. To address this issue, we reformulate the draft model optimization objective, shifting the focus from token prediction accuracy to the overall acceptance length. In this paper, we build upon PARD to propose PARD-2, a dual-mode speculative decoding framework with Confidence-Adaptive Token (CAT) optimization. This approach adaptively reweights each token to better align with the verification process. Notably, PARD-2 enables a single draft model to support both target-dependent and target-independent modes. Experiments across diverse models and tasks demonstrate that PARD-2 achieves up to 6.94times lossless acceleration, surpassing EAGLE-3 by 1.9times and PARD by 1.3times on Llama3.1-8B. Our code is available at https://github.com/AMD-AGI/PARD.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.08632
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.08632 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.08632 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.