Kimi-K2.5 β€” 2-bit GSQ Quantization

This is a simulated 2-bit quantized version of moonshotai/Kimi-K2.5, produced using GSQ, a learned post-training quantization method. The model weights are stored in compressed-tensors format and are compatible with vLLM for inference.

Note β€” Simulated quantization: The quantization was optimized at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (int32 with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit β€” there is no memory or storage saving beyond INT4 in this checkpoint.

Model Details

Property Value
Base model moonshotai/Kimi-K2.5
Architecture MoE multimodal LLM (DeepSeek-V3-style MoE)
Transformer layers 61
Routed experts 384 (8 active per token)
Hidden size 7168
Context length 262,144 tokens (256K)
Total parameters ~547B
Quantization 2-bit GSQ (stored as INT4-packed via compressed-tensors)
Quantized layers Expert FFN weights, layers 1–60
Group size 128
Calibration dataset open-thoughts/OpenThoughts-114k
Weight format compressed-tensors, pack-quantized
Disk size ~511 GB

Results

Benchmark Results (lm-evaluation-harness)

Benchmark Metric Baseline (BF16) GSQ 2-bit Ξ”
GSM8K exact_match (strict) 94.01 92.57 -1.44
ARC-Challenge acc_norm 70.14 62.97 -7.17
ARC-Easy acc_norm 88.80 85.10 -3.70
PIQA acc_norm 86.29 82.37 -3.92
WinoGrande acc 80.82 76.95 -3.87

Perplexity (WikiText-2)

Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed:

Checkpoint WikiText-2 PPL
Dense baseline 1.734
After layer 6 1.734
After layer 12 1.733
After layer 24 1.733
After layer 36 1.735
After layer 48 1.741
After layer 60 (final) 1.749

The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation).

Quantization Details

This model was quantized using GSQ, a learned post-training quantization method. Quantization was applied independently to each transformer layer using 4,096 calibration samples of sequence length 4,096 from the OpenThoughts dataset, with group size 128.

Only the MoE expert feed-forward weights (gate_proj, up_proj, down_proj) in layers 1–60 are quantized. The following components are kept in original precision:

  • Attention projections (self_attn)
  • Embeddings and the LM head
  • Layer norms
  • The shared expert
  • Layer 0's dense MLP
  • All vision tower and multimodal projector weights

Usage

This model requires vLLM for inference. Because Kimi-K2.5 uses a custom model architecture (kimi_k25), you must pass --trust-remote-code.

While the MoE expert weights are quantized to 2-bit, the attention, embedding, and norm weights remain in bfloat16, so the on-disk size is ~511 GB and the model still requires substantial GPU memory. In our testing, 8Γ— NVIDIA GH200 96 GB GPUs (2 nodes with tensor parallelism 8) are needed for serving.

Installation

pip install vllm

Serving with vLLM

vllm serve daslab-testing/Kimi-K2.5-2bit-GSQ \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --distributed-executor-backend ray \
    --tokenizer-mode hf \
    --mm-encoder-tp-mode data \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.85 \
    --max-num-seqs 4

Flag notes:

  • --tokenizer-mode hf: Required to prevent garbled output on extended serving sessions (vLLM issue #35718).
  • --mm-encoder-tp-mode data: Required for Kimi-K2.5's vision encoder β€” ViT dimensions are not evenly divisible by the tensor-parallel size, which causes cuBLAS errors without this flag.
  • --max-model-len 4096: Adjust upward if GPU memory permits; 4096 is what was used during our testing.
  • --distributed-executor-backend ray: Required for multi-node serving.

Offline inference with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="daslab-testing/Kimi-K2.5-2bit-GSQ",
    trust_remote_code=True,
    tensor_parallel_size=8,
    tokenizer_mode="hf",
    mm_encoder_tp_mode="data",
    max_model_len=4096,
    gpu_memory_utilization=0.85,
)

sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=1024)

outputs = llm.generate(["Explain the concept of entropy in thermodynamics."], sampling_params)
print(outputs[0].outputs[0].text)

Chat template

Kimi-K2.5 uses its own tokenizer and chat template. Use the tokenizer bundled with this repository:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "daslab-testing/Kimi-K2.5-2bit-GSQ",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

Limitations

  • This is a research quantization, not a production-ready release. Expect some quality degradation relative to the full-precision model, particularly on tasks requiring precise arithmetic or complex multi-step reasoning.
  • Vision/multimodal capabilities have not been evaluated post-quantization (only the language model weights were quantized).
  • The model uses a custom architecture; some inference frameworks other than vLLM may not support it without modification.

License

This model is derived from moonshotai/Kimi-K2.5 and is subject to the same license terms. Please review those terms before use.

Downloads last month
49
Safetensors
Model size
1T params
Tensor type
BF16
Β·
I64
Β·
I32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for daslab-testing/Kimi-K2.5-2bit-GSQ

Quantized
(32)
this model