BlitzKode
BlitzKode is a local AI coding assistant powered by a fine-tuned Qwen2.5-1.5B-Instruct model. It runs entirely on your machine โ no external API calls, no data leaving your device.
Tech Stack
| Layer | Tech |
|---|---|
| Base model | Qwen2.5-1.5B-Instruct |
| Fine-tuning | LoRA (r=16, ฮฑ=32) via PEFT |
| Training | HuggingFace Transformers + TRL |
| Inference | llama-cpp-python (GGUF Q8_0) |
| Backend | Python 3.11+, FastAPI, uvicorn |
| Frontend | React 18, Vite, Phosphor Icons |
Features
- Local-first โ inference with the bundled GGUF, no cloud dependency
- Real-time streaming โ SSE token-by-token via
/generate/stream - Web research mode โ DuckDuckGo search โ context-augmented generation via
/generate/research - Web search API โ standalone
/search/webendpoint for raw results - React chat UI โ streaming, conversation history, copy controls, research-mode toggle
- Multi-language โ Python, JavaScript, Java, C++, TypeScript, SQL
- API key auth + rate limiting โ production-ready security middleware
- Docker โ multi-stage production image with frontend baked in
Prerequisites
- Python 3.11+
- Node.js 20+ (for frontend dev/builds only)
blitzkode.ggufat repo root (or setBLITZKODE_MODEL_PATH)- 4 GB+ RAM
Quick Start
pip install -r requirements.txt
python server.py
# Open http://localhost:7860
Frontend Development
cd frontend
npm install
npm run dev # http://localhost:5173 โ proxies /generate and /health to :7860
Production Frontend Build
cd frontend && npm install && npm run build && cd ..
python server.py
FastAPI serves frontend/dist/index.html and /assets/* from the same port.
Docker
# CPU
docker build -t blitzkode .
docker run -p 7860:7860 -v ./blitzkode.gguf:/app/blitzkode.gguf blitzkode
# GPU (with nvidia-docker)
docker compose --profile gpu up
API Examples
# Standard generation (streaming)
curl -X POST http://localhost:7860/generate/stream \
-H "Content-Type: application/json" \
-d '{"prompt":"Write a Python function to reverse a linked list"}'
# Non-streaming
curl -X POST http://localhost:7860/generate \
-H "Content-Type: application/json" \
-d '{"prompt":"Binary search in Python","max_tokens":128}'
# Web search only
curl -X POST http://localhost:7860/search/web \
-H "Content-Type: application/json" \
-d '{"query":"FastAPI dependency injection","max_results":3}'
# Research-augmented generation (search โ inject โ answer)
curl -X POST http://localhost:7860/generate/research \
-H "Content-Type: application/json" \
-d '{"prompt":"How do I use async generators in Python 3.12?","deep_search":true}'
# Health / info
curl http://localhost:7860/health
curl http://localhost:7860/info
API Parameters
Generation (/generate, /generate/stream)
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
string | required | User request |
messages |
array | [] |
Conversation history (max 20) |
temperature |
float | 0.5 |
Sampling randomness 0.0โ2.0 |
max_tokens |
int | 256 |
Max generated tokens (cap 512) |
top_p |
float | 0.95 |
Nucleus sampling threshold |
top_k |
int | 20 |
Top-k sampling |
repeat_penalty |
float | 1.05 |
Repetition penalty |
Research (/generate/research)
Same as above, plus:
| Parameter | Type | Default | Description |
|---|---|---|---|
search_query |
string | prompt | Override query for web search |
search_results |
int | 5 |
Results to inject |
deep_search |
bool | false |
Also search documentation/best-practices variants |
Web search (/search/web)
| Parameter | Type | Default | Description |
|---|---|---|---|
query |
string | required | Search query |
max_results |
int | 5 |
Results to return |
deep |
bool | false |
Multi-variant deep search |
Environment Variables
| Variable | Default | Description |
|---|---|---|
BLITZKODE_MODEL_PATH |
blitzkode.gguf |
GGUF model path |
BLITZKODE_FRONTEND_PATH |
frontend/dist/index.html |
Built frontend |
BLITZKODE_HOST |
0.0.0.0 |
Server bind address |
BLITZKODE_PORT |
7860 |
Server port |
BLITZKODE_GPU_LAYERS |
0 |
GPU layers for llama.cpp |
BLITZKODE_N_CTX |
2048 |
Context window |
BLITZKODE_THREADS |
auto | CPU worker threads |
BLITZKODE_BATCH |
128 |
Batch size |
BLITZKODE_MAX_PROMPT_LENGTH |
4000 |
Max prompt chars |
BLITZKODE_PRELOAD_MODEL |
false |
Load model at startup |
BLITZKODE_CORS_ORIGINS |
http://localhost:7860 |
CORS origins |
BLITZKODE_API_KEY |
empty | Optional bearer token |
BLITZKODE_WEB_SEARCH |
true |
Enable web search endpoints |
BLITZKODE_SEARCH_TIMEOUT |
8 |
Search HTTP timeout (s) |
BLITZKODE_MAX_SEARCH_RESULTS |
5 |
Max search results |
BLITZKODE_RATE_LIMIT |
true |
Enable per-IP rate limiting |
BLITZKODE_RATE_LIMIT_MAX |
30 |
Requests per IP per minute |
BLITZKODE_MAX_REQUEST_BYTES |
50000 |
Request body size limit |
Training Pipeline
BlitzKode was fine-tuned through a staged pipeline on an RTX 4060 (8 GB VRAM):
| Stage | Script | Details |
|---|---|---|
| SFT v1 | train_sft.py |
LoRA r=32 on 24 curated coding examples |
| Reward-SFT | train_reward_sft.py |
Reward-heuristic continuation |
| DPO | train_dpo.py |
10 chosen/rejected preference pairs |
| SFT v2 | train_available.py |
LoRA r=16, 100 steps, 99 samples (1.5B) |
| Export | export_production.py |
Merge โ GGUF Q8_0 via llama.cpp |
Re-train from scratch
pip install -r requirements-training.txt
# Build dataset
python scripts/build_full_dataset.py
# Train 1.5B LoRA (100 steps, ~5 min on RTX 4060)
python scripts/train_available.py \
--model Qwen/Qwen2.5-1.5B-Instruct \
--quantization none \
--dataset datasets/raw/blitzkode_full_training.json \
--max-steps 100 --seq-len 384 --batch-size 1 --grad-accum 8
# Export: merge + GGUF
python scripts/export_production.py
Push to HuggingFace
export HF_TOKEN=hf_XXXX # get from https://huggingface.co/settings/tokens
python scripts/push_all_to_hub.py
This uploads:
checkpoints/blitzkode-1.5b-lora/finalโneuralbroker/blitzkode-1.5b-loracheckpoints/available-lora-0.5b-full/finalโneuralbroker/blitzkode-lora-0.5bblitzkode.ggufโneuralbroker/blitzkode
Project Structure
BlitzKode/
server.py FastAPI backend
blitzkode.gguf Local GGUF model (ignored by git)
frontend/ React/Vite web UI
src/App.jsx Chat UI with streaming + research toggle
src/index.css
vite.config.js
scripts/
train_available.py Resource-aware LoRA training
build_full_dataset.py Dataset builder
export_production.py Merge LoRA โ GGUF
push_to_hub.py Single-adapter HF push
push_all_to_hub.py Push all artifacts in one command
test_inference.py Adapter smoke test
healthcheck.sh Docker/Compose health probe
tests/test_server.py 20 backend endpoint tests (all passing)
datasets/MANIFEST.md Dataset provenance
docs/PROJECT_OVERVIEW.md Architecture & roadmap
Dockerfile Multi-stage production image
docker-compose.yml CPU + GPU service definitions
requirements.txt Serving dependencies
requirements-training.txt Training dependencies (pinned)
CI
python -m pytest tests/ -v # 20 tests, all pass
python -m ruff check . # lint
npm --prefix frontend run build # frontend build
License
MIT. See LICENSE. Also comply with Qwen2.5 upstream license when redistributing model weights.
Created by Sajad (neuralbroker)
- Downloads last month
- 201
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.