Guide to run Kimi K2.5 locally on your device.

#19
by shimmyshimmer - opened

Hey guys we made a guide to run the model locally. You'll need 240GB RAM or unified memory for best results.

Note that VRAM is not required.
You can run on a Mac with 256GB unified memory with similar speeds or 256 RAM without VRAM.

You can even run with much less compute (e.g. 80GB RAM) as it'll offload but it'll be slower.

Guide: https://unsloth.ai/docs/models/kimi-k2.5
GGUFs to run: https://huggingface.co/unsloth/Kimi-K2.5-GGUF

20260128_073800

What's the quality of the output? Does it give the same quality in writing and tool calling for Agentic works like the full model?

Hi @youhanasheriff ,

Great question! Here's what you should expect from the GGUF quantized versions:

Quality Expectations

Quantization Size Quality Impact
Q8_0 ~530GB Virtually identical to FP16 (<1% degradation)
Q6_K ~400GB Excellent quality, minimal loss
Q4_K_M ~280GB Good quality, slight degradation on complex tasks
Q3_K_M ~210GB Noticeable quality drop, still usable
Q2_K ~150GB Significant degradation, for testing only

For Agentic/Tool Calling

Tool calling and agentic tasks are more sensitive to quantization than general chat because:

  1. Structured JSON output requires precise token prediction
  2. Multi-step reasoning accumulates small errors
  3. Code generation needs exact syntax

Recommendations:

  • For serious agentic work: Q6_K or Q8_0
  • For casual use/testing: Q4_K_M works reasonably well
  • Avoid Q3 and below for tool calling

Reality Check

The full FP16/INT4 model on GPU clusters will always outperform GGUF on CPU/RAM, but for local experimentation and development, the Q6_K/Q8_0 quantizations are remarkably good.

The Unsloth team has done excellent work optimizing these quantizations specifically for Kimi-K2.5.

Hope this helps!

Sign up or log in to comment