Kimi-K2.5 — 2-bit GSQ Quantization

This is a simulated 2-bit quantized version of moonshotai/Kimi-K2.5, produced using GSQ, a learned post-training quantization method. The model weights are stored in compressed-tensors format and are compatible with vLLM for inference.

Note — Simulated quantization: The quantization was optimized at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (int32 with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit — there is no memory or storage saving beyond INT4 in this checkpoint.

Model Details

Property	Value
Base model	moonshotai/Kimi-K2.5
Architecture	MoE multimodal LLM (DeepSeek-V3-style MoE)
Transformer layers	61
Routed experts	384 (8 active per token)
Hidden size	7168
Context length	262,144 tokens (256K)
Total parameters	~547B
Quantization	2-bit GSQ (stored as INT4-packed via compressed-tensors)
Quantized layers	Expert FFN weights, layers 1–60
Group size	128
Calibration dataset	open-thoughts/OpenThoughts-114k
Weight format	compressed-tensors, `pack-quantized`
Disk size	~511 GB

Results

Benchmark Results (lm-evaluation-harness)

Benchmark	Metric	Baseline (BF16)	GSQ 2-bit	Δ
GSM8K	exact_match (strict)	94.01	92.57	-1.44
ARC-Challenge	acc_norm	70.14	62.97	-7.17
ARC-Easy	acc_norm	88.80	85.10	-3.70
PIQA	acc_norm	86.29	82.37	-3.92
WinoGrande	acc	80.82	76.95	-3.87

Perplexity (WikiText-2)

Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed:

Checkpoint	WikiText-2 PPL
Dense baseline	1.734
After layer 6	1.734
After layer 12	1.733
After layer 24	1.733
After layer 36	1.735
After layer 48	1.741
After layer 60 (final)	1.749

The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation).

Quantization Details

This model was quantized using GSQ, a learned post-training quantization method. Quantization was applied independently to each transformer layer using 4,096 calibration samples of sequence length 4,096 from the OpenThoughts dataset, with group size 128.

Only the MoE expert feed-forward weights (gate_proj, up_proj, down_proj) in layers 1–60 are quantized. The following components are kept in original precision:

Attention projections (self_attn)
Embeddings and the LM head
Layer norms
The shared expert
Layer 0's dense MLP
All vision tower and multimodal projector weights

Usage

This model requires vLLM for inference. Because Kimi-K2.5 uses a custom model architecture (kimi_k25), you must pass --trust-remote-code.

While the MoE expert weights are quantized to 2-bit, the attention, embedding, and norm weights remain in bfloat16, so the on-disk size is ~511 GB and the model still requires substantial GPU memory. In our testing, 8× NVIDIA GH200 96 GB GPUs (2 nodes with tensor parallelism 8) are needed for serving.

Installation

pip install vllm

Serving with vLLM

vllm serve daslab-testing/Kimi-K2.5-2bit-GSQ \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --distributed-executor-backend ray \
    --tokenizer-mode hf \
    --mm-encoder-tp-mode data \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.85 \
    --max-num-seqs 4

Flag notes:

--tokenizer-mode hf: Required to prevent garbled output on extended serving sessions (vLLM issue #35718).
--mm-encoder-tp-mode data: Required for Kimi-K2.5's vision encoder — ViT dimensions are not evenly divisible by the tensor-parallel size, which causes cuBLAS errors without this flag.
--max-model-len 4096: Adjust upward if GPU memory permits; 4096 is what was used during our testing.
--distributed-executor-backend ray: Required for multi-node serving.

Offline inference with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="daslab-testing/Kimi-K2.5-2bit-GSQ",
    trust_remote_code=True,
    tensor_parallel_size=8,
    tokenizer_mode="hf",
    mm_encoder_tp_mode="data",
    max_model_len=4096,
    gpu_memory_utilization=0.85,
)

sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=1024)

outputs = llm.generate(["Explain the concept of entropy in thermodynamics."], sampling_params)
print(outputs[0].outputs[0].text)

Chat template

Kimi-K2.5 uses its own tokenizer and chat template. Use the tokenizer bundled with this repository:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "daslab-testing/Kimi-K2.5-2bit-GSQ",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

Limitations

This is a research quantization, not a production-ready release. Expect some quality degradation relative to the full-precision model, particularly on tasks requiring precise arithmetic or complex multi-step reasoning.
Vision/multimodal capabilities have not been evaluated post-quantization (only the language model weights were quantized).
The model uses a custom architecture; some inference frameworks other than vLLM may not support it without modification.

License

This model is derived from moonshotai/Kimi-K2.5 and is subject to the same license terms. Please review those terms before use.

Downloads last month: 49

Safetensors

Model size

1T params

Tensor type

BF16

I64

I32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for daslab-testing/Kimi-K2.5-2bit-GSQ

Base model

moonshotai/Kimi-K2.5

Quantized

(32)

this model