Dream-v0-Instruct-7B-GGUF

GGUF quantizations of Dream-org/Dream-v0-Instruct-7B for use with diffuse-cpp, the first C++ inference engine for Diffusion Language Models.

Dream is a masked diffusion language model based on the Qwen2.5-7B backbone with Grouped Query Attention (GQA). It generates all tokens in parallel through iterative refinement, excelling at math and factual tasks.

Dream correctly solves 15 x 23 = 345 in just 2 denoising steps at 21.6 tok/s โ€” 2.5x faster than llama.cpp.

Available Quantizations

File Type Size Description
dream-7b-f16.gguf F16 ~15 GB Full precision, best quality
dream-7b-q8_0.gguf Q8_0 ~8.2 GB 8-bit quantization, near-lossless
dream-7b-q4km.gguf Q4_K_M ~5.0 GB 4-bit mixed, best speed/quality ratio

Recommended: Q4_K_M for most users.

Quick Start

# Download
huggingface-cli download diffuse-cpp/Dream-v0-Instruct-7B-GGUF dream-7b-q4km.gguf

# Build diffuse-cpp (v0.2.0+)
git clone --recursive https://github.com/iafiscal1212/diffuse-cpp.git
cd diffuse-cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run
./build/diffuse-cli -m ../dream-7b-q4km.gguf \
    --tokens "151644,8948,198,2610,525,264,10950,17847,13,151645,198,151644,872,198,3838,374,220,868,1303,220,1419,30,151645,198,151644,77091,198" \
    -n 64 -s 16 -t 12 --remasking entropy_exit

Performance

Benchmarked on AMD EPYC 4465P 12-Core, Q4_K_M, entropy_exit + inter-step cache, B=64:

Prompt tok/s Steps vs llama.cpp
Capital of France? 21.6 2 2.5x
15 x 23? 21.6 2 2.5x
Translate to French 14.3 6 1.7x
Translate to Spanish 13.2 10 1.6x
Python is_prime() 8.2 7 1.0x
Why sky blue? 4.9 16 0.6x
List planets 4.9 16 0.6x
Poem about ocean 4.5 16 0.5x
Average 11.6 1.4x
  • Dream excels at math and code (converges in 2-7 steps)
  • 5 of 8 prompts match or beat llama.cpp (8.51 tok/s baseline)
  • llama.cpp baseline: Qwen2.5-7B-Instruct, Q4_K_M, same hardware

Dream vs LLaDA

Strength Dream-7B LLaDA-8B
Math/Arithmetic 21.6 tok/s (2 steps) 6.0 tok/s (16 steps)
Code generation 8.2 tok/s (7 steps) 4.5 tok/s (15 steps)
Translation 13-14 tok/s 23-28 tok/s
Creative writing 4.5 tok/s 5.0 tok/s

Use Dream for math, code, factual tasks. Use LLaDA for translation, conversation.

Model Details

  • Architecture: Qwen2.5-7B backbone with bidirectional attention
  • Parameters: 7.62B
  • Layers: 28
  • Hidden size: 3584
  • Attention: GQA (28 query / 4 KV heads)
  • FFN: SwiGLU, intermediate 18944
  • Vocabulary: 152,064 tokens
  • RoPE theta: 1,000,000
  • Mask token ID: 151666
  • QKV biases: Yes (kept at F32 in all quantizations)

Conversion Details

339 tensors (255 weights + 84 QKV biases). Converted with convert-dream.py from diffuse-cpp.

Citation

@software{diffuse_cpp_2026,
  title={diffuse-cpp: High-Performance Inference for Diffusion Language Models},
  author={Carmen Esteban},
  year={2026},
  url={https://github.com/iafiscal1212/diffuse-cpp}
}

License

Apache 2.0

Downloads last month
86
GGUF
Model size
8B params
Architecture
diffuse
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for diffuse-cpp/Dream-v0-Instruct-7B-GGUF

Quantized
(10)
this model