Instructions to use alirezashirmarz/NICoLE-LLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use alirezashirmarz/NICoLE-LLM with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="alirezashirmarz/NICoLE-LLM", filename="nicole-f16.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use alirezashirmarz/NICoLE-LLM with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf alirezashirmarz/NICoLE-LLM:F16 # Run inference directly in the terminal: llama-cli -hf alirezashirmarz/NICoLE-LLM:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf alirezashirmarz/NICoLE-LLM:F16 # Run inference directly in the terminal: llama-cli -hf alirezashirmarz/NICoLE-LLM:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf alirezashirmarz/NICoLE-LLM:F16 # Run inference directly in the terminal: ./llama-cli -hf alirezashirmarz/NICoLE-LLM:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf alirezashirmarz/NICoLE-LLM:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf alirezashirmarz/NICoLE-LLM:F16
Use Docker
docker model run hf.co/alirezashirmarz/NICoLE-LLM:F16
- LM Studio
- Jan
- Ollama
How to use alirezashirmarz/NICoLE-LLM with Ollama:
ollama run hf.co/alirezashirmarz/NICoLE-LLM:F16
- Unsloth Studio new
How to use alirezashirmarz/NICoLE-LLM with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for alirezashirmarz/NICoLE-LLM to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for alirezashirmarz/NICoLE-LLM to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for alirezashirmarz/NICoLE-LLM to start chatting
- Docker Model Runner
How to use alirezashirmarz/NICoLE-LLM with Docker Model Runner:
docker model run hf.co/alirezashirmarz/NICoLE-LLM:F16
- Lemonade
How to use alirezashirmarz/NICoLE-LLM with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull alirezashirmarz/NICoLE-LLM:F16
Run and chat with the model
lemonade run user.NICoLE-LLM-F16
List all available models
lemonade list
NICoLE-LLM
NICoLE is a compact LLM-based controller for congestion-aware RTP/WebRTC adaptive video streaming.
It predicts:
- ECN
- Current Profile (CP)
- Next Profile (NP)
from RTP packetization and queue telemetry using compact symbolic prompting.
Optimized for:
- low-latency inference
- edge deployment
- GGUF quantization
- deterministic structured outputs
Applications:
- WebRTC adaptive streaming
- congestion-aware real-time video encoding adaptation
- in-Network QoE Optimization
- edge AI networking
Profiles
| Profile | Resolution | FPS | GoP |
|---|---|---|---|
| P0 | 3840×2160 (4K) | 30 / 60 / 90 / 120 | 2 s |
| P1 | 1920×1080 | 30 / 60 / 90 / 120 | 2 s |
| P2 | 1280×720 | 30 / 60 / 90 / 120 | 2 s |
| P3 | 640×360 | 30 / 60 / 90 / 120 | 2 s |
The dataset was generated using real-time WebRTC streaming under a 40 Mbps bottleneck shared between background traffic and adaptive RTP video streaming.
Prompt Format
Input order:
PS FS IFGS IFGR CQ LQ E
Output order:
E C N
Example:
I:PS FS IFGS IFGR CQ LQ E
O:E C N
U:1400,40,34,33,2,0,0
A:
Expected output:
text 0,1,1
Hugging Face Usage (Python code)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "YOUR_USERNAME/NICoLE-LLM"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto"
)
prompt = """I:PS FS IFGS IFGR CQ LQ E
O:E C N
U:1400,40,34,33,2,0,0
A:"""
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=6,
do_sample=False
)
print(tok.decode(out[0], skip_special_tokens=True))
GGUF / llama.cpp Usage
./llama-cli \
-no-cnv \
-t 4 \
-m nicole-q4.gguf \
-p "I:PS FS IFGS IFGR CQ LQ E
O:E C N
U:1400,40,34,33,2,0,0
A:" \
-n 6 \
--temp 0 \
--top-k 1
Runtime Configuration
| Parameter | Value |
|---|---|
| Runtime | llama.cpp |
| Quantization | Q4_K_M |
| Model Size | 636 MB |
| Context Length | 4096 |
| Inference | Deterministic |
| Prompting | Compact Symbolic |
CPU Core Benchmark
| Threads | Response (ms) | Decisions/sec | Tokens/sec |
|---|---|---|---|
| 1 | 1325 | 0.75 | 52.71 |
| 2 | 624 | 1.60 | 113.30 |
| 4 | 343 | 2.91 | 203.35 |
| 8 | 904 | 1.11 | 60.70 |
| 16 | 1043 | 0.96 | 132.46 |
| 32 | 1432 | 0.70 | 104.29 |
Best CPU deployment:
- 4 threads
- 343 ms response time
- 2.91 decisions/sec
Compact symbolic prompting significantly reduces:
- prompt tokens
- KV-cache usage
- inference latency
- deployment overhead compared to verbose natural-language prompting.
Quantized Models
Available quantization:
- Q4_K_M (recommended)
Runtime:
- llama.cpp
Designed for:
- edge deployment
- CPU inference
- bounded symbolic control inference
- real-time congestion-aware adaptation
Limitations
- Trained under a 40 Mbps bottleneck scenario
- Designed for bounded RTP/WebRTC streaming tasks
- Not intended for open-ended conversational generation
Citation
If you use this model, please cite the NICoLE paper and repository.
- Alireza Shirmarz, Fabio Luciano Verdi, Gyanesh Patra, Gergely Pongracz,"NICoLE: Are In-Network LLM-Based Agents Cost-Feasible for RTP Video Streaming?", IEEE/IFIP Networking, Switzerland 2026.
- Downloads last month
- 186