VibeVoice-ASR-HF โ€” Selective INT8 8-bit Quantization

Selectively quantized version of microsoft/VibeVoice-ASR-HF for low-VRAM deployment.

Only the Qwen2.5-7B LLM backbone is quantized. The acoustic tokenizer encoder, semantic tokenizer encoder, projection layers, and lm_head remain in full BF16 precision โ€” preserving diarization accuracy and transcription quality.

Key details

Base model microsoft/VibeVoice-ASR-HF
Quantization INT8 8-bit (bitsandbytes)
Modules quantized language_model.model.layers.* only
Modules in BF16 acoustic_tokenizer_encoder, semantic_tokenizer_encoder, acoustic_projection, semantic_projection, lm_head
Model size ~9 GB (down from 17.3 GB)
VRAM usage ~10โ€“11 GB
Transformers >= 5.3.0
bitsandbytes >= 0.48.1

Why selective quantization?

Naive 8-bit quantization of the entire model destroys diarization (all speakers collapse to SPEAKER_00) and degrades transcription quality significantly. The acoustic and semantic tokenizer encoders process raw audio signals where small numerical errors propagate catastrophically through the convolutional stages. The LLM backbone (Qwen2.5-7B) handles quantization gracefully since its weights follow a normal distribution well-suited for INT8.

Usage

import torch
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration

model_id = "Dubedo/VibeVoice-ASR-HF-INT8"

processor = AutoProcessor.from_pretrained(model_id)
model = VibeVoiceAsrForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

inputs = processor.apply_transcription_request(
    audio="path/to/audio.wav",
    prompt="optional hotwords here",
).to(model.device, model.dtype)

output_ids = model.generate(**inputs)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]

# Structured output with speaker, timestamps, text
result = processor.decode(generated_ids, return_format="parsed")[0]

Quantization method

Quantized using BitsAndBytesConfig with llm_int8_skip_modules to protect audio-critical components:

BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_skip_modules=[
        "acoustic_tokenizer_encoder",
        "semantic_tokenizer_encoder",
        "acoustic_projection",
        "semantic_projection",
        "lm_head",
    ],
)

Acknowledgments

Based on the selective quantization approach documented by FabioSarracino/VibeVoice-Large-Q8 and Enemyx-net/VibeVoice-ComfyUI, adapted for the HF-native ASR architecture in transformers 5.3.0.

Downloads last month
101
Safetensors
Model size
8B params
Tensor type
F32
ยท
BF16
ยท
I8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Dubedo/VibeVoice-ASR-HF-INT8

Quantized
(2)
this model