How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="dcostenco/prism-coder-14b",
	filename="",
)
llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

prism-coder:14b โ€” Tool Routing Model (Desktop Primary Tier)

Fine-tuned Qwen3-14B for 6-tool routing in the Prism AAC system. First tier in the desktop cascade: 14B โ†’ 32B โ†’ cloud Claude.

BFCL Routing Benchmark โ€” v33 (Current)

Mean: 97.1% (3-seed average, seeds 2027/2028/2029, 102 cases each)

Category Description Accuracy
aac AAC phrase requests โ†’ plain text 100%
cmpct Ledger compaction 100%
edge Multi-step / compound requests 100%
hand Agent handoff / relay 88%
info General facts โ†’ plain text 100%
irrel Irrelevant / live queries โ†’ plain text 100%
know Knowledge base search 100%
load Session context loading 100%
pred Factual / knowledge queries โ†’ plain text 100%
save Session ledger save 92%
smem Session memory search 92%
tran Translation requests โ†’ plain text 100%

Eval: Ollama inference, temperature=0, Qwen3 thinking suppressed (<think>\n\n</think>), num_predict=160.
Gate: โ‰ฅ90% = deploy.

Version History

Version BFCL Notes
v33 97.1% Routing corpus v33, improved hand/save/smem
v32 97.1% Routing corpus v32
v31 ~96% Routing corpus v31
v30 ~95% Baseline 14B routing

Tools

The model routes between exactly 6 tools:

  1. session_load_context โ€” load/fetch/resume project context
  2. session_save_ledger โ€” note/log/remember/record progress
  3. session_save_handoff โ€” handoff/relay to next agent/session
  4. session_compact_ledger โ€” compact/archive/shrink ledger
  5. session_search_memory โ€” recall past sessions/conversations
  6. knowledge_search โ€” search stored notes/knowledge base

Files

File Size Use
prism-aac-14b-q4km.gguf 9.3 GB Recommended for Ollama

Cascade Role

Primary desktop tier. Handles ~97% of routing decisions locally.
Escalates to 32B for edge cases and multi-step compound requests.

Usage (Ollama)

ollama run dcostenco/prism-coder:14b

Training

  • Base: Qwen/Qwen3-14B (fp16, 14.8B params)
  • Framework: MLX-LM LoRA (rank=8, scale=20, 4 layers)
  • Merge: Direct safetensors manipulation (delta = scale/rank ร— B^T A^T)
  • Hardware: Apple Silicon (M-series, 64 GB RAM)
Downloads last month
1,098
GGUF
Model size
15B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for dcostenco/prism-coder-14b

Finetuned
Qwen/Qwen3-14B
Quantized
(178)
this model
Quantizations
2 models