New uploads adds llama.cpp fixes

#11
by danielhanchen - opened

New uploads that add 2 new fixes. You will need to redownload.

  1. vocab: fix Gemma4 tokenizer (#21343) - https://github.com/ggml-org/llama.cpp/pull/21343
  2. fix: gemma 4 template (#21326) - https://github.com/ggml-org/llama.cpp/pull/21326

Hi @danielhanchen , thank you for your service.
The model finally started working in RooCode after those updates 🦾

One request - the IQ3_S and IQ3_XXS are the same file size. But on my setup (12GB VRAM + 16GB RAM), a slightly smaller (around 10GB) for IQ3_XXS would be a perfect match. Could you πŸ™ please upload a 10GBish model variant of Q3.

Screenshot 2026-04-04 at 10.09.17

I'm having some issues. I'm not sure if they are related to the .gguf format or the model itself.
Sometimes I get an infinite repetition of the same phrase (like "me... me... me...") or tokens similar to what is described here: [link] but at the end of response. In my case, it shows different tags (I can't remember exactly which ones, <eos> or something similar).
Also sometimes, model write it's answer in <thinking> section.
These issues happen from time to time, but most of the time, the model works just fine.

Unsloth AI org

Hi @danielhanchen , thank you for your service.
The model finally started working in RooCode after those updates 🦾

One request - the IQ3_S and IQ3_XXS are the same file size. But on my setup (12GB VRAM + 16GB RAM), a slightly smaller (around 10GB) for IQ3_XXS would be a perfect match. Could you πŸ™ please upload a 10GBish model variant of Q3.

Screenshot 2026-04-04 at 10.09.17

We'll see what we can do. For now the Q2 quants should be decent enough. Gemma 4 quantization doesnt have that much different between the bits.

I'm having some issues. I'm not sure if they are related to the .gguf format or the model itself.
Sometimes I get an infinite repetition of the same phrase (like "me... me... me...") or tokens similar to what is described here: [link] but at the end of response. In my case, it shows different tags (I can't remember exactly which ones, <eos> or something similar).
Also sometimes, model write it's answer in <thinking> section.
These issues happen from time to time, but most of the time, the model works just fine.

Could be related to the CUDA version you're using. Where are you using it? In unsloth studio you shouldn't experience the issue

Could be related to the CUDA version you're using. Where are you using it? In unsloth studio you shouldn't experience the issue

Cuda 12.8, latest version of llama.cpp server, Open WebUI. Maybe it's because I'm using complex system instructions? But I don't have similar issues on 31b model.

Could be related to the CUDA version you're using. Where are you using it? In unsloth studio you shouldn't experience the issue

Cuda 12.8, latest version of llama.cpp server, Open WebUI. Maybe it's because I'm using complex system instructions? But I don't have similar issues on 31b model.

I'm not entirely sure if this is related but I do know they started using Cuda 13+ in the latest llama.cpp server-cuda image. I had to update in order to run my llama.cpp containers.

I'm having some issues. I'm not sure if they are related to the .gguf format or the model itself.
Sometimes I get an infinite repetition of the same phrase (like "me... me... me...") or tokens similar to what is described here: [link] but at the end of response. In my case, it shows different tags (I can't remember exactly which ones, <eos> or something similar).
Also sometimes, model write it's answer in <thinking> section.
These issues happen from time to time, but most of the time, the model works just fine.

Getting the same thing. I just spun up a new strix halo, so had to rebuild my podman containers which meant a fresh recompile of llama.cpp. I copied over the models from my old strix halo (downloaded less than 8 hours from the original unsloth upload) and used the fresh llama containers - and gemma is losing it's mind. Trying a fresh download of these ggufs to see if it's any better.

https://github.com/ggml-org/llama.cpp/issues/21423

Post-b8660 llama.cpp builds have some issues with the IQ4_XS quant I tested. Does the quant need an update or does llama.cpp need a fix?

build : b8667-c08d28d08
model : gemma-4-26B-A4B-it-UD-IQ4_XS.gguf

Just knocked a hard problem out of the park for me, pulled the git repo three hours ago. April 05 8:44 CST.

Getting the same thing. I just spun up a new strix halo, so had to rebuild my podman containers which meant a fresh recompile of llama.cpp. I copied over the models from my old strix halo (downloaded less than 8 hours from the original unsloth upload) and used the fresh llama containers - and gemma is losing it's mind. Trying a fresh download of these ggufs to see if it's any better.

Did you get anything working? I'm building the latest llama.cpp from github, taking the latest ones gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf and get weird output like:

cellars-and-and-and-and-and... You'slosh-and-and-and-and and-and-and and-and-and-and and-and-and ... You slip on theness of lissness-and-and-and-and and and-and-and and-and and-and-and and-and and-and and-and... You stumble-and-and-and-and and-and and-and to the corner of support-and

It seems correct prompt fixes the problem. Just adding empty:

<|channel>thought
<channel|>

prefix to initial model output.

Full prompt:

<|turn>system
You are an assistant.
<turn|>
<|turn>user
Generate a story.
<turn|>
<|turn>model
<|channel>thought
<channel|>

I'm still getting gibberish or < unused49 > spam on Strix Halo with ROCm Version: 7.13.0a20260404 - Llama.cpp Commit Hash: c08d2 - Build Date: 2026-04-05 15:33:29 UTC

UD-IQ4_XS.gguf uploaded 2-3 days ago works perfectly, so i wonder if it makes sense to try recently updated quant.
I noticed, IQ4_XS quants keep surprising me in terms of generation results, being much smaller than q4km q5km. Hidden jem in real world usage. Benchmarks for some reason tell different story.

Unsloth AI org

UD-IQ4_XS.gguf uploaded 2-3 days ago works perfectly, so i wonder if it makes sense to try recently updated quant.
I noticed, IQ4_XS quants keep surprising me in terms of generation results, being much smaller than q4km q5km. Hidden jem in real world usage. Benchmarks for some reason tell different story.

I think that was when we updated it. We will be updating it again later this week once llama.cpp fixes more bugs

danielhanchen pinned discussion

I'm still getting gibberish or < unused49 > spam on Strix Halo with ROCm Version: 7.13.0a20260404 - Llama.cpp Commit Hash: c08d2 - Build Date: 2026-04-05 15:33:29 UTC

This seems to be related to https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/2
By trial and error, i found this bug to also be present on UD-Q4_K_XL and MXFP4_MOE, llama.cpp b8664. Unsloth UD-Q5_K_S doesn't have that problem, as well as other quants (bartowski, ggml, lmstudio), looking at other users seems like smaller quants work fine too. Looks like unsloth 4-bit specific bug.

I'm having some issues. I'm not sure if they are related to the .gguf format or the model itself.
Sometimes I get an infinite repetition of the same phrase (like "me... me... me...") or tokens similar to what is described here: [link] but at the end of response. In my case, it shows different tags (I can't remember exactly which ones, <eos> or something similar).
Also sometimes, model write it's answer in <thinking> section.
These issues happen from time to time, but most of the time, the model works just fine.

Try to increase temp to resolve it. Also you can increase TopK and use 0.01 for minP

Unsloth AI org

I'm still getting gibberish or < unused49 > spam on Strix Halo with ROCm Version: 7.13.0a20260404 - Llama.cpp Commit Hash: c08d2 - Build Date: 2026-04-05 15:33:29 UTC

This seems to be related to https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/2
By trial and error, i found this bug to also be present on UD-Q4_K_XL and MXFP4_MOE, llama.cpp b8664. Unsloth UD-Q5_K_S doesn't have that problem, as well as other quants (bartowski, ggml, lmstudio), looking at other users seems like smaller quants work fine too. Looks like unsloth 4-bit specific bug.

Apparently it is now solved for you through the update?

danielhanchen unpinned discussion

Sign up or log in to comment