Llama-3.1-70B-Instruct-NVFP4-W4A4

This is an NVFP4 (4-bit floating point) quantized version of meta-llama/Llama-3.1-70B-Instruct created using llm-compressor.

Note: This model quantizes Weights and Activations to FP4. KV cache is NOT quantized and remains in bf16 (original precision).

Quantization Details

  • Quantization Method: NVFP4 W4A4 (Weight and Activation only)
  • Weight Precision: FP4 (4-bit floating point)
    • Per-tensor global scales + Per-group (size 16) local quantization scales
  • Activation Precision: FP4 (4-bit floating point)
    • Per-tensor scales with dynamic local quantization
  • KV Cache: bf16 (not quantized, remains in original precision)
  • Quantization Scheme: NVFP4 (NVIDIA FP4 format)
  • Ignored Layers: lm_head only
  • Calibration Dataset: CNN/DailyMail
  • Calibration Samples: 512

Model Size

  • Original Model: ~140GB (bf16)
  • Quantized Model: ~40GB (NVFP4 W4A4)
  • Compression Ratio: ~3.5x

Usage

Installation

pip install vllm>=0.6.0

With vLLM

from vllm import LLM, SamplingParams

# Load the NVFP4 W4A4 quantized model
llm = LLM(
    model="JongYeop/Llama-3.1-70B-Instruct-NVFP4-W4A4",
    quantization="fp4"  # or "nvfp4"
)

# Generate text
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

With Transformers (for inspection)

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("JongYeop/Llama-3.1-70B-Instruct-NVFP4-W4A4")
model = AutoModelForCausalLM.from_pretrained(
    "JongYeop/Llama-3.1-70B-Instruct-NVFP4-W4A4",
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Performance

NVFP4 W4A4 quantization provides:

  • ~3.5x memory reduction compared to bf16
  • Faster inference with FP4-capable hardware (e.g., NVIDIA H100, B200)
  • Higher compression than FP8 while maintaining good accuracy
  • Efficient group-wise quantization for weights (group size: 16)

Hardware Requirements

  • GPU: NVIDIA GPU with compute capability > 8.9 (Ada Lovelace, Hopper)
    • Examples: RTX 4090, L40S, H100, H200, B200
  • VRAM: Minimum 48GB for inference (e.g., single H100 80GB)

Why NVFP4 W4A4?

NVFP4 (NVIDIA FP4) is a 4-bit floating-point format that provides:

  1. Better dynamic range than INT4 quantization
  2. Higher compression than FP8 (4-bit vs 8-bit)
  3. Group-wise quantization for weights to preserve accuracy
  4. Per-tensor dynamic quantization for activations

KV Cache remains in bf16 because:

  • Preserves generation quality for long contexts
  • Reduces potential accuracy degradation in multi-turn conversations
  • Allows easier comparison with other quantization methods

Quantization Recipe

The quantization recipe used for this model is included in the repository as recipe.yaml.

Key configuration:

quant_stage:
  quant_modifiers:
    QuantizationModifier:
      ignore: ["lm_head"]
      scheme: "NVFP4"      # NVIDIA FP4 format
      targets: ["Linear"]

Comparison with Other Formats

Format Model Size Compression KV Cache Notes
BF16 (Original) ~140GB 1.0x bf16 Full precision
FP8 W8A8 ~70GB 2.0x bf16 Good balance
NVFP4 W4A4 ~40GB 3.5x bf16 Higher compression
FP8 W8A8+KV ~65GB 2.2x fp8 Full quantization

Citation

If you use this model, please cite:

@software{llm-compressor,
  title = {LLM Compressor},
  author = {vLLM Team},
  url = {https://github.com/vllm-project/llm-compressor},
  year = {2024}
}

@article{llama3,
  title={Llama 3 Model Card},
  author={AI@Meta},
  year={2024},
  url={https://github.com/meta-llama/llama3}
}

Reference Models

License

This model inherits the license from the original Llama 3.1 model.

Acknowledgments

Downloads last month
31
Safetensors
Model size
41B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for JongYeop/Llama-3.1-70B-Instruct-NVFP4-W4A4

Quantized
(114)
this model