DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 61
F16 GGUF conversion of deepseek-ai/deepseek-moe-16b-base with Rust bindings for llama.cpp's MoE CPU offloading functionality.
This model supports MoE CPU offloading via llama.cpp (implemented in PR #15077). Shimmy provides Rust bindings for this functionality, enabling:
| Configuration | VRAM | TPS | TTFT |
|---|---|---|---|
| GPU-only | 30.1GB | 26.8 | 426ms |
| CPU Offload | 2.3GB | 6.5 | 1,643ms |
Trade-off: Memory for speed. Best for VRAM-constrained scenarios where generation speed is less critical than model size.
DeepSeek MoE uses a dual-expert architecture (64 regular + 2 shared experts), validated to work correctly with CPU offloading:
ffn_gate_exps.weight, ffn_down_exps.weight, ffn_up_exps.weightffn_gate_shexp.weight, ffn_down_shexp.weight, ffn_up_shexp.weighthuggingface-cli download MikeKuykendall/deepseek-moe-16b-cpu-offload-gguf \
--include "deepseek-moe-16b-f16.gguf" \
--local-dir ./models
# Standard loading (requires ~32GB VRAM)
./llama-server -m deepseek-moe-16b-f16.gguf -c 4096
# With MoE CPU offloading (requires ~3GB VRAM + 32GB RAM)
./llama-server -m deepseek-moe-16b-f16.gguf -c 4096 --cpu-moe
# Install Shimmy
cargo install --git https://github.com/Michael-A-Kuykendall/shimmy --features llama-cuda
# Standard loading
shimmy serve --model deepseek-moe-16b-f16.gguf
# With MoE CPU offloading
shimmy serve --model deepseek-moe-16b-f16.gguf --cpu-moe
# Query the API
curl http://localhost:11435/api/generate \
-d '{
"model": "deepseek-moe-16b",
"prompt": "Explain the architecture of DeepSeek MoE",
"max_tokens": 256,
"stream": false
}'
Standard GPU Loading:
CPU Offloading:
Full validation report with controlled baselines: Shimmy MoE CPU Offloading Technical Report
@article{dai2024deepseekmoe,
title={DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models},
author={Dai, Damai and others},
journal={arXiv preprint arXiv:2401.06066},
year={2024}
}
GGUF conversion and MoE offloading validation by MikeKuykendall
16-bit
Base model
deepseek-ai/deepseek-moe-16b-base