Kimi-K2.5 β 2-bit GSQ Quantization
This is a simulated 2-bit quantized version of moonshotai/Kimi-K2.5, produced using GSQ, a learned post-training quantization method. The model weights are stored in compressed-tensors format and are compatible with vLLM for inference.
Note β Simulated quantization: The quantization was optimized at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (
int32with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit β there is no memory or storage saving beyond INT4 in this checkpoint.
Model Details
| Property | Value |
|---|---|
| Base model | moonshotai/Kimi-K2.5 |
| Architecture | MoE multimodal LLM (DeepSeek-V3-style MoE) |
| Transformer layers | 61 |
| Routed experts | 384 (8 active per token) |
| Hidden size | 7168 |
| Context length | 262,144 tokens (256K) |
| Total parameters | ~547B |
| Quantization | 2-bit GSQ (stored as INT4-packed via compressed-tensors) |
| Quantized layers | Expert FFN weights, layers 1β60 |
| Group size | 128 |
| Calibration dataset | open-thoughts/OpenThoughts-114k |
| Weight format | compressed-tensors, pack-quantized |
| Disk size | ~511 GB |
Results
Benchmark Results (lm-evaluation-harness)
| Benchmark | Metric | Baseline (BF16) | GSQ 2-bit | Ξ |
|---|---|---|---|---|
| GSM8K | exact_match (strict) | 94.01 | 92.57 | -1.44 |
| ARC-Challenge | acc_norm | 70.14 | 62.97 | -7.17 |
| ARC-Easy | acc_norm | 88.80 | 85.10 | -3.70 |
| PIQA | acc_norm | 86.29 | 82.37 | -3.92 |
| WinoGrande | acc | 80.82 | 76.95 | -3.87 |
Perplexity (WikiText-2)
Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed:
| Checkpoint | WikiText-2 PPL |
|---|---|
| Dense baseline | 1.734 |
| After layer 6 | 1.734 |
| After layer 12 | 1.733 |
| After layer 24 | 1.733 |
| After layer 36 | 1.735 |
| After layer 48 | 1.741 |
| After layer 60 (final) | 1.749 |
The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation).
Quantization Details
This model was quantized using GSQ, a learned post-training quantization method. Quantization was applied independently to each transformer layer using 4,096 calibration samples of sequence length 4,096 from the OpenThoughts dataset, with group size 128.
Only the MoE expert feed-forward weights (gate_proj, up_proj, down_proj) in layers 1β60 are quantized. The following components are kept in original precision:
- Attention projections (
self_attn) - Embeddings and the LM head
- Layer norms
- The shared expert
- Layer 0's dense MLP
- All vision tower and multimodal projector weights
Usage
This model requires vLLM for inference. Because Kimi-K2.5 uses a custom model architecture (kimi_k25), you must pass --trust-remote-code.
While the MoE expert weights are quantized to 2-bit, the attention, embedding, and norm weights remain in bfloat16, so the on-disk size is ~511 GB and the model still requires substantial GPU memory. In our testing, 8Γ NVIDIA GH200 96 GB GPUs (2 nodes with tensor parallelism 8) are needed for serving.
Installation
pip install vllm
Serving with vLLM
vllm serve daslab-testing/Kimi-K2.5-2bit-GSQ \
--trust-remote-code \
--tensor-parallel-size 8 \
--distributed-executor-backend ray \
--tokenizer-mode hf \
--mm-encoder-tp-mode data \
--max-model-len 4096 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 4
Flag notes:
--tokenizer-mode hf: Required to prevent garbled output on extended serving sessions (vLLM issue #35718).--mm-encoder-tp-mode data: Required for Kimi-K2.5's vision encoder β ViT dimensions are not evenly divisible by the tensor-parallel size, which causes cuBLAS errors without this flag.--max-model-len 4096: Adjust upward if GPU memory permits; 4096 is what was used during our testing.--distributed-executor-backend ray: Required for multi-node serving.
Offline inference with vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="daslab-testing/Kimi-K2.5-2bit-GSQ",
trust_remote_code=True,
tensor_parallel_size=8,
tokenizer_mode="hf",
mm_encoder_tp_mode="data",
max_model_len=4096,
gpu_memory_utilization=0.85,
)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=1024)
outputs = llm.generate(["Explain the concept of entropy in thermodynamics."], sampling_params)
print(outputs[0].outputs[0].text)
Chat template
Kimi-K2.5 uses its own tokenizer and chat template. Use the tokenizer bundled with this repository:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"daslab-testing/Kimi-K2.5-2bit-GSQ",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
Limitations
- This is a research quantization, not a production-ready release. Expect some quality degradation relative to the full-precision model, particularly on tasks requiring precise arithmetic or complex multi-step reasoning.
- Vision/multimodal capabilities have not been evaluated post-quantization (only the language model weights were quantized).
- The model uses a custom architecture; some inference frameworks other than vLLM may not support it without modification.
License
This model is derived from moonshotai/Kimi-K2.5 and is subject to the same license terms. Please review those terms before use.
- Downloads last month
- 49
Model tree for daslab-testing/Kimi-K2.5-2bit-GSQ
Base model
moonshotai/Kimi-K2.5