Dream-v0-Instruct-7B-GGUF

GGUF quantizations of Dream-org/Dream-v0-Instruct-7B for use with diffuse-cpp, the first C++ inference engine for Diffusion Language Models.

Dream is a masked diffusion language model based on the Qwen2.5-7B backbone with Grouped Query Attention (GQA). It generates all tokens in parallel through iterative refinement, excelling at math and factual tasks.

Dream correctly solves 15 x 23 = 345 in just 2 denoising steps at 21.6 tok/s — 2.5x faster than llama.cpp.

Available Quantizations

File	Type	Size	Description
`dream-7b-f16.gguf`	F16	~15 GB	Full precision, best quality
`dream-7b-q8_0.gguf`	Q8_0	~8.2 GB	8-bit quantization, near-lossless
`dream-7b-q4km.gguf`	Q4_K_M	~5.0 GB	4-bit mixed, best speed/quality ratio

Recommended: Q4_K_M for most users.

Quick Start

# Download
huggingface-cli download diffuse-cpp/Dream-v0-Instruct-7B-GGUF dream-7b-q4km.gguf

# Build diffuse-cpp (v0.2.0+)
git clone --recursive https://github.com/iafiscal1212/diffuse-cpp.git
cd diffuse-cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run
./build/diffuse-cli -m ../dream-7b-q4km.gguf \
    --tokens "151644,8948,198,2610,525,264,10950,17847,13,151645,198,151644,872,198,3838,374,220,868,1303,220,1419,30,151645,198,151644,77091,198" \
    -n 64 -s 16 -t 12 --remasking entropy_exit

Performance

Benchmarked on AMD EPYC 4465P 12-Core, Q4_K_M, entropy_exit + inter-step cache, B=64:

Prompt	tok/s	Steps	vs llama.cpp
Capital of France?	21.6	2	2.5x
15 x 23?	21.6	2	2.5x
Translate to French	14.3	6	1.7x
Translate to Spanish	13.2	10	1.6x
Python is_prime()	8.2	7	1.0x
Why sky blue?	4.9	16	0.6x
List planets	4.9	16	0.6x
Poem about ocean	4.5	16	0.5x
Average	11.6		1.4x

Dream excels at math and code (converges in 2-7 steps)
5 of 8 prompts match or beat llama.cpp (8.51 tok/s baseline)
llama.cpp baseline: Qwen2.5-7B-Instruct, Q4_K_M, same hardware

Dream vs LLaDA

Strength	Dream-7B	LLaDA-8B
Math/Arithmetic	21.6 tok/s (2 steps)	6.0 tok/s (16 steps)
Code generation	8.2 tok/s (7 steps)	4.5 tok/s (15 steps)
Translation	13-14 tok/s	23-28 tok/s
Creative writing	4.5 tok/s	5.0 tok/s

Use Dream for math, code, factual tasks. Use LLaDA for translation, conversation.

Model Details

Architecture: Qwen2.5-7B backbone with bidirectional attention
Parameters: 7.62B
Layers: 28
Hidden size: 3584
Attention: GQA (28 query / 4 KV heads)
FFN: SwiGLU, intermediate 18944
Vocabulary: 152,064 tokens
RoPE theta: 1,000,000
Mask token ID: 151666
QKV biases: Yes (kept at F32 in all quantizations)

Conversion Details

339 tensors (255 weights + 84 QKV biases). Converted with convert-dream.py from diffuse-cpp.

Citation

@software{diffuse_cpp_2026,
  title={diffuse-cpp: High-Performance Inference for Diffusion Language Models},
  author={Carmen Esteban},
  year={2026},
  url={https://github.com/iafiscal1212/diffuse-cpp}
}

License

Apache 2.0

Downloads last month: 86

GGUF

Model size

8B params

Architecture

diffuse

Hardware compatibility

8-bit

View +1 variant

Model tree for diffuse-cpp/Dream-v0-Instruct-7B-GGUF

Base model

Dream-org/Dream-v0-Instruct-7B

Quantized

(10)

this model