Instructions to use neuralbroker/blitzkode with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use neuralbroker/blitzkode with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="neuralbroker/blitzkode",
	filename="blitzkode.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

llama-cpp-python

How to use neuralbroker/blitzkode with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="neuralbroker/blitzkode",
	filename="blitzkode.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use neuralbroker/blitzkode with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
llama-cli -hf neuralbroker/blitzkode

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
llama-cli -hf neuralbroker/blitzkode

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
./llama-cli -hf neuralbroker/blitzkode

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
./build/bin/llama-cli -hf neuralbroker/blitzkode

Use Docker

docker model run hf.co/neuralbroker/blitzkode

LM Studio
Jan

vLLM

How to use neuralbroker/blitzkode with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "neuralbroker/blitzkode"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "neuralbroker/blitzkode",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/neuralbroker/blitzkode

Ollama
How to use neuralbroker/blitzkode with Ollama:
```
ollama run hf.co/neuralbroker/blitzkode
```

Unsloth Studio new

How to use neuralbroker/blitzkode with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for neuralbroker/blitzkode to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for neuralbroker/blitzkode to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for neuralbroker/blitzkode to start chatting

Pi new

How to use neuralbroker/blitzkode with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf neuralbroker/blitzkode

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "neuralbroker/blitzkode"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use neuralbroker/blitzkode with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf neuralbroker/blitzkode

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default neuralbroker/blitzkode

Run Hermes

hermes

Docker Model Runner
How to use neuralbroker/blitzkode with Docker Model Runner:
```
docker model run hf.co/neuralbroker/blitzkode
```

Lemonade

How to use neuralbroker/blitzkode with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull neuralbroker/blitzkode

Run and chat with the model

lemonade run user.blitzkode-{{QUANT_TAG}}

List all available models

lemonade list

BlitzKode

BlitzKode is a local AI coding assistant powered by a fine-tuned Qwen2.5-1.5B-Instruct model. It runs entirely on your machine — no external API calls, no data leaving your device.

Tech Stack

Layer	Tech
Base model	Qwen2.5-1.5B-Instruct
Fine-tuning	LoRA (r=16, α=32) via PEFT
Training	HuggingFace Transformers + TRL
Inference	llama-cpp-python (GGUF Q8_0)
Backend	Python 3.11+, FastAPI, uvicorn
Frontend	React 18, Vite, Phosphor Icons

Features

Local-first — inference with the bundled GGUF, no cloud dependency
Real-time streaming — SSE token-by-token via /generate/stream
Web research mode — DuckDuckGo search → context-augmented generation via /generate/research
Web search API — standalone /search/web endpoint for raw results
React chat UI — streaming, conversation history, copy controls, research-mode toggle
Multi-language — Python, JavaScript, Java, C++, TypeScript, SQL
API key auth + rate limiting — production-ready security middleware
Docker — multi-stage production image with frontend baked in

Prerequisites

Python 3.11+
Node.js 20+ (for frontend dev/builds only)
blitzkode.gguf at repo root (or set BLITZKODE_MODEL_PATH)
4 GB+ RAM

Quick Start

pip install -r requirements.txt
python server.py
# Open http://localhost:7860

Frontend Development

cd frontend
npm install
npm run dev      # http://localhost:5173 — proxies /generate and /health to :7860

Production Frontend Build

cd frontend && npm install && npm run build && cd ..
python server.py

FastAPI serves frontend/dist/index.html and /assets/* from the same port.

Docker

# CPU
docker build -t blitzkode .
docker run -p 7860:7860 -v ./blitzkode.gguf:/app/blitzkode.gguf blitzkode

# GPU (with nvidia-docker)
docker compose --profile gpu up

API Examples

# Standard generation (streaming)
curl -X POST http://localhost:7860/generate/stream \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Write a Python function to reverse a linked list"}'

# Non-streaming
curl -X POST http://localhost:7860/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Binary search in Python","max_tokens":128}'

# Web search only
curl -X POST http://localhost:7860/search/web \
  -H "Content-Type: application/json" \
  -d '{"query":"FastAPI dependency injection","max_results":3}'

# Research-augmented generation (search → inject → answer)
curl -X POST http://localhost:7860/generate/research \
  -H "Content-Type: application/json" \
  -d '{"prompt":"How do I use async generators in Python 3.12?","deep_search":true}'

# Health / info
curl http://localhost:7860/health
curl http://localhost:7860/info

API Parameters

Generation (`/generate`, `/generate/stream`)

Parameter	Type	Default	Description
`prompt`	string	required	User request
`messages`	array	`[]`	Conversation history (max 20)
`temperature`	float	`0.5`	Sampling randomness `0.0–2.0`
`max_tokens`	int	`256`	Max generated tokens (cap 512)
`top_p`	float	`0.95`	Nucleus sampling threshold
`top_k`	int	`20`	Top-k sampling
`repeat_penalty`	float	`1.05`	Repetition penalty

Research (`/generate/research`)

Same as above, plus:

Parameter	Type	Default	Description
`search_query`	string	prompt	Override query for web search
`search_results`	int	`5`	Results to inject
`deep_search`	bool	`false`	Also search documentation/best-practices variants

Web search (`/search/web`)

Parameter	Type	Default	Description
`query`	string	required	Search query
`max_results`	int	`5`	Results to return
`deep`	bool	`false`	Multi-variant deep search

Environment Variables

Variable	Default	Description
`BLITZKODE_MODEL_PATH`	`blitzkode.gguf`	GGUF model path
`BLITZKODE_FRONTEND_PATH`	`frontend/dist/index.html`	Built frontend
`BLITZKODE_HOST`	`0.0.0.0`	Server bind address
`BLITZKODE_PORT`	`7860`	Server port
`BLITZKODE_GPU_LAYERS`	`0`	GPU layers for llama.cpp
`BLITZKODE_N_CTX`	`2048`	Context window
`BLITZKODE_THREADS`	auto	CPU worker threads
`BLITZKODE_BATCH`	`128`	Batch size
`BLITZKODE_MAX_PROMPT_LENGTH`	`4000`	Max prompt chars
`BLITZKODE_PRELOAD_MODEL`	`false`	Load model at startup
`BLITZKODE_CORS_ORIGINS`	`http://localhost:7860`	CORS origins
`BLITZKODE_API_KEY`	empty	Optional bearer token
`BLITZKODE_WEB_SEARCH`	`true`	Enable web search endpoints
`BLITZKODE_SEARCH_TIMEOUT`	`8`	Search HTTP timeout (s)
`BLITZKODE_MAX_SEARCH_RESULTS`	`5`	Max search results
`BLITZKODE_RATE_LIMIT`	`true`	Enable per-IP rate limiting
`BLITZKODE_RATE_LIMIT_MAX`	`30`	Requests per IP per minute
`BLITZKODE_MAX_REQUEST_BYTES`	`50000`	Request body size limit

Training Pipeline

BlitzKode was fine-tuned through a staged pipeline on an RTX 4060 (8 GB VRAM):

Stage	Script	Details
SFT v1	`train_sft.py`	LoRA r=32 on 24 curated coding examples
Reward-SFT	`train_reward_sft.py`	Reward-heuristic continuation
DPO	`train_dpo.py`	10 chosen/rejected preference pairs
SFT v2	`train_available.py`	LoRA r=16, 100 steps, 99 samples (1.5B)
Export	`export_production.py`	Merge → GGUF Q8_0 via llama.cpp

Re-train from scratch

pip install -r requirements-training.txt

# Build dataset
python scripts/build_full_dataset.py

# Train 1.5B LoRA (100 steps, ~5 min on RTX 4060)
python scripts/train_available.py \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --quantization none \
  --dataset datasets/raw/blitzkode_full_training.json \
  --max-steps 100 --seq-len 384 --batch-size 1 --grad-accum 8

# Export: merge + GGUF
python scripts/export_production.py

Push to HuggingFace

export HF_TOKEN=hf_XXXX        # get from https://huggingface.co/settings/tokens
python scripts/push_all_to_hub.py

This uploads:

checkpoints/blitzkode-1.5b-lora/final → neuralbroker/blitzkode-1.5b-lora
checkpoints/available-lora-0.5b-full/final → neuralbroker/blitzkode-lora-0.5b
blitzkode.gguf → neuralbroker/blitzkode

Project Structure

BlitzKode/
  server.py                    FastAPI backend
  blitzkode.gguf               Local GGUF model (ignored by git)
  frontend/                    React/Vite web UI
    src/App.jsx                Chat UI with streaming + research toggle
    src/index.css
    vite.config.js
  scripts/
    train_available.py         Resource-aware LoRA training
    build_full_dataset.py      Dataset builder
    export_production.py       Merge LoRA → GGUF
    push_to_hub.py             Single-adapter HF push
    push_all_to_hub.py         Push all artifacts in one command
    test_inference.py          Adapter smoke test
    healthcheck.sh             Docker/Compose health probe
  tests/test_server.py         20 backend endpoint tests (all passing)
  datasets/MANIFEST.md         Dataset provenance
  docs/PROJECT_OVERVIEW.md     Architecture & roadmap
  Dockerfile                   Multi-stage production image
  docker-compose.yml           CPU + GPU service definitions
  requirements.txt             Serving dependencies
  requirements-training.txt    Training dependencies (pinned)

CI

python -m pytest tests/ -v          # 20 tests, all pass
python -m ruff check .              # lint
npm --prefix frontend run build     # frontend build

License

MIT. See LICENSE. Also comply with Qwen2.5 upstream license when redistributing model weights.

Created by Sajad (neuralbroker)

Downloads last month: 201

GGUF

Model size

2B params

Architecture

qwen2

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for neuralbroker/blitzkode

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Quantized

(178)

this model