Thanks to the SpecForge framework for their foundational contributions. Stay tuned for further updates.
Model Overview
GLM-5.1-eagle3 is an advanced and highly specialized draft model meticulously engineered to significantly accelerate the inference process of the GLM-5.1 ecosystem, leveraging the powerful EAGLE3 framework.
Architected upon the robust Llama architecture, this model functions as an exceptionally efficient drafter. It has undergone rigorous training on 1 million high-quality samples sourced from the comprehensive open-perfectblend dataset and some multimodal data. This extensive training ensures precise and strict alignment with the teacher model's distribution, thereby guaranteeing high fidelity and performance.
Performance & Acceleration
The core value of this EAGLE3 model is its ability to predict multiple future tokens that are subsequently verified by the base model. High acceptance lengths indicate significant latency reduction. Continuous future iterations.
Speculative Decoding Configuration:
--speculative-num-steps 3: Configures the number of speculative decoding steps.--speculative-eagle-topk 1: Sets thetop-kvalue for the Eagle draft model during speculative decoding.--speculative-num-draft-tokens 4: Specifies the number of draft tokens generated in each speculative step.
MTP vs Eagle3 Performance Comparison: Batch Sizes (bs) 1 and 32
Despite Eagle3's slightly lower accept length compared to MTP, it achieves higher output throughput across most benchmarks, indicating superior overall efficiency.
Throughput Comparison (token/s)
| Stage | MTP (bs=32) | Eagle3 (bs=32) | MTP (bs=1) | Eagle3 (bs=1) | bs=32 Advantage (MTP/Eagle3) | bs=1 Advantage (MTP/Eagle3) |
|---|---|---|---|---|---|---|
| mtbench | 1127.30 | 1129.48 | 151.25 | 146.83 | -0.2% | +3.0% |
| humaneval | 1292.04 | 1369.01 | 167.58 | 175.04 | -5.6% | -4.3% |
| gsm8k | 682.24 | 686.80 | 134.78 | 133.23 | -0.7% | +1.2% |
| math500 | 1648.05 | 1703.82 | 180.72 | 183.13 | -3.3% | -1.3% |
Accept Length Comparison
| Stage | MTP (bs=32) | Eagle3 (bs=32) | MTP (bs=1) | Eagle3 (bs=1) | bs=32 Difference (MTP-Eagle3) | bs=1 Difference (MTP-Eagle3) |
|---|---|---|---|---|---|---|
| mtbench | 2.93 | 2.78 | 2.86 | 2.70 | +6.9% | +6.0% |
| humaneval | 3.23 | 3.24 | 3.21 | 3.27 | -0.3% | -1.8% |
| gsm8k | 3.14 | 3.00 | 3.14 | 3.00 | +4.7% | +4.4% |
| math500 | 3.40 | 3.35 | 3.41 | 3.37 | +1.5% | +1.3% |
Quick Start
Requirements
- NVIDIA GPU
- CUDA 12.0+
- PyTorch 2.0+
Installation
pip install sglang==0.5.10
Inference with SGLang
Eagle3
python3 -m sglang.launch_server \
--model-path zai-org/GLM-5.1-FP8 \
--tp-size 8 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path AQ-MedAI/GLM-5.1-Eagle3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.8 \
--host 0.0.0.0 --port 30019 --attention-backend fa3
MTP
python3 -m sglang.launch_server \
--model-path zai-org/GLM-5.1-FP8 \
--tp-size 8 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.8 \
--host 0.0.0.0 --port 30019 --attention-backend fa3
Citation
If you use this model in your research or application, please cite the following:
@misc{glm5.1eagle3,
title={GLM-5.1-eagle3: Accelerating Instruction Following with EAGLE3},
author={Ant AQ Team},
year={2026},
}
- Downloads last month
- 85