Decent PPL with 100% IQ4_KSS

#3
by sokann - opened

I tried a quant with 100% IQ4_KSS tensors, and the PPL is quite good:

Final estimate: PPL over 594 chunks for n_ctx=512 = 3.9098 +/- 0.02107

The size is 58.27 GiB, so about 10 GiB smaller πŸ˜„

Nice!

Yeah i kept the attn tensors all a little larger at iq6_k and also ffn_down_exp one size larger at iq4_ks so the perplexity will be slightly better at a cost of size. 58 or 68 GB is still slightly awkward break point in size as folks will likely have either 48 or 96GB VRAM... but with your iq4_kss you can definitely fit more context if needed and it will be slightly faster i'm sure too!

Thanks for the report!

I'm not sure what other sizes I'd like to release here even, and may not release any more sizes unless there are specific requests. dense models recipes are harder to smash down to 2ish bpw and keep them smart enough haha...

Just saw your comment on r/LocalLLaMA about the various quantization types. Very educational πŸ‘

Incidentally, previously I also made a IQ3_XXS / IQ4_XS mix for mainline:

## Attention [0-87]
## Keep qkv the same to allow --merge-qkv
blk\..*\.attn_q.*\.weight=iq4_xs
blk\..*\.attn_k.*\.weight=iq4_xs
blk\..*\.attn_v.*\.weight=iq4_xs
blk\..*\.attn_output.*\.weight=iq4_xs

## Dense Layers [0-87]
blk\..*\.ffn_down\.weight=iq3_xxs
blk\..*\.ffn_(gate|up)\.weight=iq3_xxs

## Non-Repeating layers
token_embd\.weight=iq4_xs
output\.weight=iq4_xs

And this has a much worse PPL of:

Final estimate: PPL over 594 chunks for n_ctx=512 = 4.4030 +/- 0.02604

However, for my eval, it somehow performs the closest to the devstral-2512 served from https://api.mistral.ai, compared to the other bigger quants that I tried. This is really quite bizarre. Might be just some coincidence. I think previously @AesSedai also got some great GLM-4.5/4.6 quants with IQ3_XXS.

Interesting mix, seems reasonable for mainline quant! i'd only suggest changing the repeating layers, the tradition for mainline quants is:

## Non-Repeating layers
token_embd\.weight=q4_K
output\.weight=q6_K

This won't make it much bigger as they are not repeating, and typically keep output "head" at ~6bpwish and token embedding at 4-6bpw is fine. Keep in mind it is case sensitive.

However, for my eval, it somehow performs the closest to the devstral-2512 served from https://api.mistral.ai

Huh, it could be the official version served is a lower quant to help them save on costs maybe some 4ish bpw vllm type quant? Also what is your "eval" ? Yeah iq3_xxs is one of the last quants ik did on mainline before the newer stuff on ik_llama.cpp...

Thanks for the suggestion. Yeah if I remember correctly IK mentioned somewhere that Q4 is good enough for token_embd, but I can't find the reference haha

Huh, it could be the official version served is a lower quant to help them save on costs maybe some 4ish bpw vllm type quant? Also what is your "eval" ?

It is a "deterministic" code editing eval where I got the model to make a targeted change to a function, and compare the result with a set of golden files. So when the result diverges, it is very easy to tell. For Devstral Small 2, I actually made a reddit comment wondering about whether mistral is pulling a Matt Shumer, as I got very bad result locally even with Q8, while the official API was almost perfect. But it turned out to be an inference issue that was fixed by https://github.com/ggml-org/llama.cpp/pull/17945. Now the eval is perfect for both Q8 and Q6_K.

However, after 17945, Devstral 2 started to return repeated broken sentences for me, and this was also reported by https://github.com/ggml-org/llama.cpp/issues/18007

In the gguf, these 2 metadata don't look quite right:

general.architecture 	llama
...
llama.rope.scaling.yarn_log_multiplier 	1

The additional fixes from https://github.com/ggml-org/llama.cpp/pull/18006 may not be sufficient. And then there is still https://github.com/ggml-org/llama.cpp/issues/17980. Many issues πŸ˜…

@sokann

lol yeah seems like a few things are being ironed out still which is typical of a new release I suppose. I recall bartowski had a change too at some point to fixup some tool calling stuff: https://github.com/ggml-org/llama.cpp/compare/master...bartowski1182:llama.cpp:master

Stepping back 20 years and taking a look at everything I feel like folks are just coming up with "MCP" "tool-use" stuff that really is just a rough bridge between plain-text LLM interface and everything else that has been around forever like HTML, REST APIs, JSON RPCs, etc...

What is your preference for client software for tool/agentic use these days? I tried mistral-vibe a bit but I swear it was making the LLM write code that was JSON string escaped in order to push it into a file write tool-call... I'd have assumed it was trained on non-JSON encoded python for example, and figured it should write the code in plain python format to give better output and let the client handle JSON encoding it before passing it to the MCP server to write the file ??

I suppose it is early days though, and cool to see all the experimentation as people try to figure out how to do something useful with these models haha...

I use qwen-code and claude-code from time to time, prefer qwen-code as the Chat Completion API is more common, e.g. when using inference providers such as Fireworks, and it also keeps the whole conversation from start to end without messing up the KV cache, so more friendly to local GPU-poor setup also haha

I have only tried mistral-vibe for a bit. Yeah it is probably not a good idea to get the model to write JSON encoded python code. Previously, in my eval, I found the old gen models to be very bad in writing diff, so I always got them to output the whole rewritten function instead, and then piece it back in the bash script.

Recently, opus-4.5 with claude-code did help me to troubleshoot an issue at day job. It is quite something. My 3.2bpw GLM-4.7 quant with qwen-code could also find the root cause, but it sort of stumbled upon the culprit, whereas opus-4.5 did it like a pro. Hopefully GLM-5 can match that 😬

@sokann

Yeah very curious how GLM-5, Kimi's next release, and especially DeepSeek-V4 will perform!

My experimental "dense attention only" https://huggingface.co/ubergarm/DeepSeek-V3.2-Speciale-GGUF seems pretty good for one-shotting programming questions (but no tool use built in without special stuff mentioned over here: https://huggingface.co/sszymczyk/DeepSeek-V3.2-nolight-GGUF/discussions/1#6960ef9120016df3a4ad023b

Happy hacking!

Yeah the V3.2-Speciale model is indeed quite special. I followed your guide to make an even smaller quant that fits into 128GB RAM + 24GB VRAM, with 3 x 17 routed experts tensors quantized to IQ1_M_R4 and 3 x 41 routed experts tensors quantized to IQ1_S_R4. The PPL is 5.0339, quite a bit worse compared to the IQ1_KT quant. It somehow still works nicely, and was able to reason through a complicated deadlock issue after 20k~30k reasoning tokens.

The conventional wisdom is such that even if a model can be quantized to less than 4 bits, it shouldn't be done for reasoning model, as the errors will keep on accumulating while generating the reasoning tokens, eventually diverging too much from the original model. V3.2-Speciale kind of invalidates that πŸ˜…

@sokann

Oh sweet! If you release that quant def post something about it! I didn't do any the smallest "barely runs on 128GB" IQ1_S & IQ1_M so glad you covered that!

The conventional wisdom is such that even if a model can be quantized to less than 4 bits, it shouldn't be done for reasoning model

I've heard some folks suggest that reasoning models can take heavier quantization with the theory being the reasoning allows it to recover from mistakes, but who knows xD

But yeah this model doesn't increase in perplexity as fast with heavier quantization as others I've tried, and pretty incredible these sub 2bpw quants are usable!

I was expecting disaster with UD_Q4K_XL after hearing all the flack about unsloth quants but wikitest raw gave: 512 CTX - Final estimate: PPL over 594 chunks for n_ctx=512 = 3.8403 +/- 0.02028

@Lockout

Heya, what are you doing here with all the new Qwen models dropping? lmao

  • IQ4_KSS 68.536 GiB (4.709 BPW)
    • PPL over 594 chunks for n_ctx=512 = 3.8832 +/- 0.02076
  • UD-Q4_K_XL over 75GB
    • Your number: 3.8403 +/- 0.02028

There is always some slight variation depending on the hardware backend used e.g. CUDA/CPU-only etc. I use CPU-only methodology explained here: https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF/discussions/3#698f7ebf2aa648f3b77a1262

So I'd have to run it myself probably to get the most apples-apples comparison, or you could run both my quant and that one on your same rig using the same methodology etc.

My opinion is that UD quants are generally fine for dense models and will give somewhat similar PPL/KLD as bartowski mradermacher etc.

For MoE models I prefer my own recipes and AesSedai's.

Have fun!

Meh, the qwen models are very censored and will be slower because of hybrid inference. I was getting worried that I should upgrade my mistral quant but it turns out not to be the case. We are in the same ballpark with similar size.

I think the only qwen I'm willing to try is the big one. But that's after I play with minimax. My system isn't one where I can just boof all exp into the CPU and get more than single digit tokens so I have context pressure. Maybe after IK plays with numa that situation will change. Getting really addicted to those 30+ t/s of the fully offloaded dense model and the flavor of the month benchmaxx just doesn't look so appetizing. Double so when you add the all night download-a-thon.

Sign up or log in to comment