BlitzKode

BlitzKode is a local AI coding assistant powered by a fine-tuned Qwen2.5-1.5B-Instruct model. It runs entirely on your machine โ€” no external API calls, no data leaving your device.

Tech Stack

Layer Tech
Base model Qwen2.5-1.5B-Instruct
Fine-tuning LoRA (r=16, ฮฑ=32) via PEFT
Training HuggingFace Transformers + TRL
Inference llama-cpp-python (GGUF Q8_0)
Backend Python 3.11+, FastAPI, uvicorn
Frontend React 18, Vite, Phosphor Icons

Features

  • Local-first โ€” inference with the bundled GGUF, no cloud dependency
  • Real-time streaming โ€” SSE token-by-token via /generate/stream
  • Web research mode โ€” DuckDuckGo search โ†’ context-augmented generation via /generate/research
  • Web search API โ€” standalone /search/web endpoint for raw results
  • React chat UI โ€” streaming, conversation history, copy controls, research-mode toggle
  • Multi-language โ€” Python, JavaScript, Java, C++, TypeScript, SQL
  • API key auth + rate limiting โ€” production-ready security middleware
  • Docker โ€” multi-stage production image with frontend baked in

Prerequisites

  • Python 3.11+
  • Node.js 20+ (for frontend dev/builds only)
  • blitzkode.gguf at repo root (or set BLITZKODE_MODEL_PATH)
  • 4 GB+ RAM

Quick Start

pip install -r requirements.txt
python server.py
# Open http://localhost:7860

Frontend Development

cd frontend
npm install
npm run dev      # http://localhost:5173 โ€” proxies /generate and /health to :7860

Production Frontend Build

cd frontend && npm install && npm run build && cd ..
python server.py

FastAPI serves frontend/dist/index.html and /assets/* from the same port.

Docker

# CPU
docker build -t blitzkode .
docker run -p 7860:7860 -v ./blitzkode.gguf:/app/blitzkode.gguf blitzkode

# GPU (with nvidia-docker)
docker compose --profile gpu up

API Examples

# Standard generation (streaming)
curl -X POST http://localhost:7860/generate/stream \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Write a Python function to reverse a linked list"}'

# Non-streaming
curl -X POST http://localhost:7860/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Binary search in Python","max_tokens":128}'

# Web search only
curl -X POST http://localhost:7860/search/web \
  -H "Content-Type: application/json" \
  -d '{"query":"FastAPI dependency injection","max_results":3}'

# Research-augmented generation (search โ†’ inject โ†’ answer)
curl -X POST http://localhost:7860/generate/research \
  -H "Content-Type: application/json" \
  -d '{"prompt":"How do I use async generators in Python 3.12?","deep_search":true}'

# Health / info
curl http://localhost:7860/health
curl http://localhost:7860/info

API Parameters

Generation (/generate, /generate/stream)

Parameter Type Default Description
prompt string required User request
messages array [] Conversation history (max 20)
temperature float 0.5 Sampling randomness 0.0โ€“2.0
max_tokens int 256 Max generated tokens (cap 512)
top_p float 0.95 Nucleus sampling threshold
top_k int 20 Top-k sampling
repeat_penalty float 1.05 Repetition penalty

Research (/generate/research)

Same as above, plus:

Parameter Type Default Description
search_query string prompt Override query for web search
search_results int 5 Results to inject
deep_search bool false Also search documentation/best-practices variants

Web search (/search/web)

Parameter Type Default Description
query string required Search query
max_results int 5 Results to return
deep bool false Multi-variant deep search

Environment Variables

Variable Default Description
BLITZKODE_MODEL_PATH blitzkode.gguf GGUF model path
BLITZKODE_FRONTEND_PATH frontend/dist/index.html Built frontend
BLITZKODE_HOST 0.0.0.0 Server bind address
BLITZKODE_PORT 7860 Server port
BLITZKODE_GPU_LAYERS 0 GPU layers for llama.cpp
BLITZKODE_N_CTX 2048 Context window
BLITZKODE_THREADS auto CPU worker threads
BLITZKODE_BATCH 128 Batch size
BLITZKODE_MAX_PROMPT_LENGTH 4000 Max prompt chars
BLITZKODE_PRELOAD_MODEL false Load model at startup
BLITZKODE_CORS_ORIGINS http://localhost:7860 CORS origins
BLITZKODE_API_KEY empty Optional bearer token
BLITZKODE_WEB_SEARCH true Enable web search endpoints
BLITZKODE_SEARCH_TIMEOUT 8 Search HTTP timeout (s)
BLITZKODE_MAX_SEARCH_RESULTS 5 Max search results
BLITZKODE_RATE_LIMIT true Enable per-IP rate limiting
BLITZKODE_RATE_LIMIT_MAX 30 Requests per IP per minute
BLITZKODE_MAX_REQUEST_BYTES 50000 Request body size limit

Training Pipeline

BlitzKode was fine-tuned through a staged pipeline on an RTX 4060 (8 GB VRAM):

Stage Script Details
SFT v1 train_sft.py LoRA r=32 on 24 curated coding examples
Reward-SFT train_reward_sft.py Reward-heuristic continuation
DPO train_dpo.py 10 chosen/rejected preference pairs
SFT v2 train_available.py LoRA r=16, 100 steps, 99 samples (1.5B)
Export export_production.py Merge โ†’ GGUF Q8_0 via llama.cpp

Re-train from scratch

pip install -r requirements-training.txt

# Build dataset
python scripts/build_full_dataset.py

# Train 1.5B LoRA (100 steps, ~5 min on RTX 4060)
python scripts/train_available.py \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --quantization none \
  --dataset datasets/raw/blitzkode_full_training.json \
  --max-steps 100 --seq-len 384 --batch-size 1 --grad-accum 8

# Export: merge + GGUF
python scripts/export_production.py

Push to HuggingFace

export HF_TOKEN=hf_XXXX        # get from https://huggingface.co/settings/tokens
python scripts/push_all_to_hub.py

This uploads:

  • checkpoints/blitzkode-1.5b-lora/final โ†’ neuralbroker/blitzkode-1.5b-lora
  • checkpoints/available-lora-0.5b-full/final โ†’ neuralbroker/blitzkode-lora-0.5b
  • blitzkode.gguf โ†’ neuralbroker/blitzkode

Project Structure

BlitzKode/
  server.py                    FastAPI backend
  blitzkode.gguf               Local GGUF model (ignored by git)
  frontend/                    React/Vite web UI
    src/App.jsx                Chat UI with streaming + research toggle
    src/index.css
    vite.config.js
  scripts/
    train_available.py         Resource-aware LoRA training
    build_full_dataset.py      Dataset builder
    export_production.py       Merge LoRA โ†’ GGUF
    push_to_hub.py             Single-adapter HF push
    push_all_to_hub.py         Push all artifacts in one command
    test_inference.py          Adapter smoke test
    healthcheck.sh             Docker/Compose health probe
  tests/test_server.py         20 backend endpoint tests (all passing)
  datasets/MANIFEST.md         Dataset provenance
  docs/PROJECT_OVERVIEW.md     Architecture & roadmap
  Dockerfile                   Multi-stage production image
  docker-compose.yml           CPU + GPU service definitions
  requirements.txt             Serving dependencies
  requirements-training.txt    Training dependencies (pinned)

CI

python -m pytest tests/ -v          # 20 tests, all pass
python -m ruff check .              # lint
npm --prefix frontend run build     # frontend build

License

MIT. See LICENSE. Also comply with Qwen2.5 upstream license when redistributing model weights.


Created by Sajad (neuralbroker)

Downloads last month
201
GGUF
Model size
2B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for neuralbroker/blitzkode

Quantized
(178)
this model