New uploads adds llama.cpp fixes

#11

by danielhanchen - opened 20 days ago

Discussion

danielhanchen

Unsloth AI org 20 days ago

•

edited 19 days ago

New uploads that add 2 new fixes. You will need to redownload.

vocab: fix Gemma4 tokenizer (#21343) - https://github.com/ggml-org/llama.cpp/pull/21343
fix: gemma 4 template (#21326) - https://github.com/ggml-org/llama.cpp/pull/21326

piloponth

19 days ago

•

edited 19 days ago

Hi @danielhanchen , thank you for your service.
The model finally started working in RooCode after those updates 🦾

One request - the IQ3_S and IQ3_XXS are the same file size. But on my setup (12GB VRAM + 16GB RAM), a slightly smaller (around 10GB) for IQ3_XXS would be a perfect match. Could you 🙏 please upload a 10GBish model variant of Q3.

Kelheor

19 days ago

I'm having some issues. I'm not sure if they are related to the .gguf format or the model itself.
Sometimes I get an infinite repetition of the same phrase (like "me... me... me...") or tokens similar to what is described here: [link] but at the end of response. In my case, it shows different tags (I can't remember exactly which ones, <eos> or something similar).
Also sometimes, model write it's answer in <thinking> section.
These issues happen from time to time, but most of the time, the model works just fine.

danielhanchen

Unsloth AI org 19 days ago

Hi @danielhanchen , thank you for your service.
The model finally started working in RooCode after those updates 🦾

One request - the IQ3_S and IQ3_XXS are the same file size. But on my setup (12GB VRAM + 16GB RAM), a slightly smaller (around 10GB) for IQ3_XXS would be a perfect match. Could you 🙏 please upload a 10GBish model variant of Q3.

We'll see what we can do. For now the Q2 quants should be decent enough. Gemma 4 quantization doesnt have that much different between the bits.

I'm having some issues. I'm not sure if they are related to the .gguf format or the model itself.
Sometimes I get an infinite repetition of the same phrase (like "me... me... me...") or tokens similar to what is described here: [link] but at the end of response. In my case, it shows different tags (I can't remember exactly which ones, <eos> or something similar).
Also sometimes, model write it's answer in <thinking> section.
These issues happen from time to time, but most of the time, the model works just fine.

Could be related to the CUDA version you're using. Where are you using it? In unsloth studio you shouldn't experience the issue

Kelheor

19 days ago

•

edited 19 days ago

Could be related to the CUDA version you're using. Where are you using it? In unsloth studio you shouldn't experience the issue

Cuda 12.8, latest version of llama.cpp server, Open WebUI. Maybe it's because I'm using complex system instructions? But I don't have similar issues on 31b model.

WintsWorks

19 days ago

•

edited 19 days ago

Could be related to the CUDA version you're using. Where are you using it? In unsloth studio you shouldn't experience the issue

Cuda 12.8, latest version of llama.cpp server, Open WebUI. Maybe it's because I'm using complex system instructions? But I don't have similar issues on 31b model.

I'm not entirely sure if this is related but I do know they started using Cuda 13+ in the latest llama.cpp server-cuda image. I had to update in order to run my llama.cpp containers.

TimmyD21

19 days ago

I'm having some issues. I'm not sure if they are related to the .gguf format or the model itself.
Sometimes I get an infinite repetition of the same phrase (like "me... me... me...") or tokens similar to what is described here: [link] but at the end of response. In my case, it shows different tags (I can't remember exactly which ones, <eos> or something similar).
Also sometimes, model write it's answer in <thinking> section.
These issues happen from time to time, but most of the time, the model works just fine.

Getting the same thing. I just spun up a new strix halo, so had to rebuild my podman containers which meant a fresh recompile of llama.cpp. I copied over the models from my old strix halo (downloaded less than 8 hours from the original unsloth upload) and used the fresh llama containers - and gemma is losing it's mind. Trying a fresh download of these ggufs to see if it's any better.

Sadmank

19 days ago

https://github.com/ggml-org/llama.cpp/issues/21423

Post-b8660 llama.cpp builds have some issues with the IQ4_XS quant I tested. Does the quant need an update or does llama.cpp need a fix?

BingoBird

19 days ago

build : b8667-c08d28d08
model : gemma-4-26B-A4B-it-UD-IQ4_XS.gguf

Just knocked a hard problem out of the park for me, pulled the git repo three hours ago. April 05 8:44 CST.

dimaischenko

17 days ago

Getting the same thing. I just spun up a new strix halo, so had to rebuild my podman containers which meant a fresh recompile of llama.cpp. I copied over the models from my old strix halo (downloaded less than 8 hours from the original unsloth upload) and used the fresh llama containers - and gemma is losing it's mind. Trying a fresh download of these ggufs to see if it's any better.

Did you get anything working? I'm building the latest llama.cpp from github, taking the latest ones gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf and get weird output like:

cellars-and-and-and-and-and... You'slosh-and-and-and-and and-and-and and-and-and-and and-and-and ... You slip on theness of lissness-and-and-and-and and and-and-and and-and and-and-and and-and and-and and-and... You stumble-and-and-and-and and-and and-and to the corner of support-and

dimaischenko

17 days ago

It seems correct prompt fixes the problem. Just adding empty:

<|channel>thought
<channel|>

prefix to initial model output.

Full prompt:

<|turn>system
You are an assistant.
<turn|>
<|turn>user
Generate a story.
<turn|>
<|turn>model
<|channel>thought
<channel|>

Kackliqur

17 days ago

•

edited 17 days ago

I'm still getting gibberish or < unused49 > spam on Strix Halo with ROCm Version: 7.13.0a20260404 - Llama.cpp Commit Hash: c08d2 - Build Date: 2026-04-05 15:33:29 UTC

urtuuuu

17 days ago

•

edited 17 days ago

UD-IQ4_XS.gguf uploaded 2-3 days ago works perfectly, so i wonder if it makes sense to try recently updated quant.
I noticed, IQ4_XS quants keep surprising me in terms of generation results, being much smaller than q4km q5km. Hidden jem in real world usage. Benchmarks for some reason tell different story.

danielhanchen

Unsloth AI org 16 days ago

UD-IQ4_XS.gguf uploaded 2-3 days ago works perfectly, so i wonder if it makes sense to try recently updated quant.
I noticed, IQ4_XS quants keep surprising me in terms of generation results, being much smaller than q4km q5km. Hidden jem in real world usage. Benchmarks for some reason tell different story.

I think that was when we updated it. We will be updating it again later this week once llama.cpp fixes more bugs

danielhanchen pinned discussion 16 days ago

GideonWyeth

16 days ago

I'm still getting gibberish or < unused49 > spam on Strix Halo with ROCm Version: 7.13.0a20260404 - Llama.cpp Commit Hash: c08d2 - Build Date: 2026-04-05 15:33:29 UTC

This seems to be related to https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/2
By trial and error, i found this bug to also be present on UD-Q4_K_XL and MXFP4_MOE, llama.cpp b8664. Unsloth UD-Q5_K_S doesn't have that problem, as well as other quants (bartowski, ggml, lmstudio), looking at other users seems like smaller quants work fine too. Looks like unsloth 4-bit specific bug.

Gogich77

16 days ago

I'm having some issues. I'm not sure if they are related to the .gguf format or the model itself.
Sometimes I get an infinite repetition of the same phrase (like "me... me... me...") or tokens similar to what is described here: [link] but at the end of response. In my case, it shows different tags (I can't remember exactly which ones, <eos> or something similar).
Also sometimes, model write it's answer in <thinking> section.
These issues happen from time to time, but most of the time, the model works just fine.

Try to increase temp to resolve it. Also you can increase TopK and use 0.01 for minP

danielhanchen

Unsloth AI org 14 days ago

I'm still getting gibberish or < unused49 > spam on Strix Halo with ROCm Version: 7.13.0a20260404 - Llama.cpp Commit Hash: c08d2 - Build Date: 2026-04-05 15:33:29 UTC

This seems to be related to https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/2
By trial and error, i found this bug to also be present on UD-Q4_K_XL and MXFP4_MOE, llama.cpp b8664. Unsloth UD-Q5_K_S doesn't have that problem, as well as other quants (bartowski, ggml, lmstudio), looking at other users seems like smaller quants work fine too. Looks like unsloth 4-bit specific bug.

Apparently it is now solved for you through the update?

danielhanchen unpinned discussion 14 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment