Llama-3.1-70B-Instruct-NVFP4-W4A4
This is an NVFP4 (4-bit floating point) quantized version of meta-llama/Llama-3.1-70B-Instruct created using llm-compressor.
Note: This model quantizes Weights and Activations to FP4. KV cache is NOT quantized and remains in bf16 (original precision).
Quantization Details
- Quantization Method: NVFP4 W4A4 (Weight and Activation only)
- Weight Precision: FP4 (4-bit floating point)
- Per-tensor global scales + Per-group (size 16) local quantization scales
- Activation Precision: FP4 (4-bit floating point)
- Per-tensor scales with dynamic local quantization
- KV Cache: bf16 (not quantized, remains in original precision)
- Quantization Scheme: NVFP4 (NVIDIA FP4 format)
- Ignored Layers:
lm_headonly - Calibration Dataset: CNN/DailyMail
- Calibration Samples: 512
Model Size
- Original Model: ~140GB (bf16)
- Quantized Model: ~40GB (NVFP4 W4A4)
- Compression Ratio: ~3.5x
Usage
Installation
pip install vllm>=0.6.0
With vLLM
from vllm import LLM, SamplingParams
# Load the NVFP4 W4A4 quantized model
llm = LLM(
model="JongYeop/Llama-3.1-70B-Instruct-NVFP4-W4A4",
quantization="fp4" # or "nvfp4"
)
# Generate text
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
With Transformers (for inspection)
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("JongYeop/Llama-3.1-70B-Instruct-NVFP4-W4A4")
model = AutoModelForCausalLM.from_pretrained(
"JongYeop/Llama-3.1-70B-Instruct-NVFP4-W4A4",
device_map="auto"
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Performance
NVFP4 W4A4 quantization provides:
- ~3.5x memory reduction compared to bf16
- Faster inference with FP4-capable hardware (e.g., NVIDIA H100, B200)
- Higher compression than FP8 while maintaining good accuracy
- Efficient group-wise quantization for weights (group size: 16)
Hardware Requirements
- GPU: NVIDIA GPU with compute capability > 8.9 (Ada Lovelace, Hopper)
- Examples: RTX 4090, L40S, H100, H200, B200
- VRAM: Minimum 48GB for inference (e.g., single H100 80GB)
Why NVFP4 W4A4?
NVFP4 (NVIDIA FP4) is a 4-bit floating-point format that provides:
- Better dynamic range than INT4 quantization
- Higher compression than FP8 (4-bit vs 8-bit)
- Group-wise quantization for weights to preserve accuracy
- Per-tensor dynamic quantization for activations
KV Cache remains in bf16 because:
- Preserves generation quality for long contexts
- Reduces potential accuracy degradation in multi-turn conversations
- Allows easier comparison with other quantization methods
Quantization Recipe
The quantization recipe used for this model is included in the repository as recipe.yaml.
Key configuration:
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["lm_head"]
scheme: "NVFP4" # NVIDIA FP4 format
targets: ["Linear"]
Comparison with Other Formats
| Format | Model Size | Compression | KV Cache | Notes |
|---|---|---|---|---|
| BF16 (Original) | ~140GB | 1.0x | bf16 | Full precision |
| FP8 W8A8 | ~70GB | 2.0x | bf16 | Good balance |
| NVFP4 W4A4 | ~40GB | 3.5x | bf16 | Higher compression |
| FP8 W8A8+KV | ~65GB | 2.2x | fp8 | Full quantization |
Citation
If you use this model, please cite:
@software{llm-compressor,
title = {LLM Compressor},
author = {vLLM Team},
url = {https://github.com/vllm-project/llm-compressor},
year = {2024}
}
@article{llama3,
title={Llama 3 Model Card},
author={AI@Meta},
year={2024},
url={https://github.com/meta-llama/llama3}
}
Reference Models
- Original model: meta-llama/Llama-3.1-70B-Instruct
- NVIDIA's NVFP4 example: nvidia/Llama-3.1-8B-Instruct-NVFP4
License
This model inherits the license from the original Llama 3.1 model.
Acknowledgments
- Original model: meta-llama/Llama-3.1-70B-Instruct
- Quantization tool: llm-compressor by vLLM team
- Quantization guide: vLLM FP4 W4A4 Documentation
- Downloads last month
- 31
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support
Model tree for JongYeop/Llama-3.1-70B-Instruct-NVFP4-W4A4
Base model
meta-llama/Llama-3.1-70B
Finetuned
meta-llama/Llama-3.1-70B-Instruct