VibeVoice-ASR β€” Selective INT8 Quantization

Selectively quantized version of microsoft/VibeVoice-ASR for low-VRAM deployment.

Only the Qwen2.5-7B LLM backbone is quantized to INT8. Audio tokenizers, connectors, and lm_head remain in full BF16 precision β€” preserving diarization accuracy and transcription quality.

⚠️ This model uses the standalone vibevoice package (pip install git+https://github.com/microsoft/VibeVoice.git), NOT the HF-native transformers >= 5.3.0 variant. It requires transformers == 4.57.3.

Key details

Base model microsoft/VibeVoice-ASR
Quantization INT8 (bitsandbytes Linear8bitLt)
Modules quantized model.language_model.model.layers.* (196 layers)
Modules in BF16 acoustic_tokenizer, semantic_tokenizer, acoustic_connector, semantic_connector, lm_head (161 layers)
Model size ~9.2 GB (down from 17.3 GB)
Peak VRAM ~12.5 GB (including inference activations)
Transformers == 4.57.3
bitsandbytes >= 0.48.1

Why selective quantization?

Naive INT8 quantization of the entire model produces [Unintelligible Speech] β€” the model detects speech boundaries but cannot decode content. The acoustic and semantic tokenizer encoders process raw audio signals where quantization errors propagate catastrophically. The LLM backbone (Qwen2.5-7B) handles INT8 quantization gracefully.

Critical discovery: The standalone vibevoice package uses different module names than the HF-native variant. The correct skip list for the standalone model is:

Standalone (this model) HF-native (won't work here)
acoustic_tokenizer acoustic_tokenizer_encoder
semantic_tokenizer semantic_tokenizer_encoder
acoustic_connector acoustic_projection
semantic_connector semantic_projection

Using the HF-native names with the standalone package silently quantizes audio-critical modules, producing garbage output.

Usage

import torch
from vibevoice.modular.modeling_vibevoice_asr import VibeVoiceASRForConditionalGeneration
from vibevoice.processor.vibevoice_asr_processor import VibeVoiceASRProcessor

model_id = "Dubedo/VibeVoice-ASR-INT8"

# Load processor (no preprocessor_config.json β€” default ratio=3200 is correct)
processor = VibeVoiceASRProcessor.from_pretrained(
    model_id,
    language_model_pretrained_name="Qwen/Qwen2.5-7B",
)

# Load quantized model
model = VibeVoiceASRForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

# Transcribe
inputs = processor(
    audio=["path/to/audio.wav"],
    sampling_rate=None,
    return_tensors="pt",
    padding=True,
    add_generation_prompt=True,
)
inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=32768,
        pad_token_id=processor.pad_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        do_sample=False,
    )

input_length = inputs["input_ids"].shape[1]
generated_ids = output_ids[0, input_length:]
text = processor.decode(generated_ids, skip_special_tokens=True)
segments = processor.post_process_transcription(text)

Quantization method

Quantized on NVIDIA L4 (22GB) using the standalone vibevoice package with BitsAndBytesConfig:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_skip_modules=[
        "acoustic_tokenizer",
        "semantic_tokenizer",
        "acoustic_connector",
        "semantic_connector",
        "lm_head",
    ],
)

model = VibeVoiceASRForConditionalGeneration.from_pretrained(
    "microsoft/VibeVoice-ASR",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Important notes

  • Do NOT create a preprocessor_config.json β€” the standalone processor's default fallback sets speech_tok_compress_ratio=3200, which is correct. Creating one with ratio=320 causes a 10x mask shape mismatch and IndexError.
  • Requires bitsandbytes >= 0.48.1 β€” v0.48.0 has a confirmed critical bug breaking INT8 quantization.
  • INT8 models cannot be moved between CPU and GPU β€” use delete+reload pattern for VRAM management.

Acknowledgments

Based on microsoft/VibeVoice-ASR. Built for the Dubedo AI video dubbing platform.

Downloads last month
15
Safetensors
Model size
9B params
Tensor type
BF16
Β·
F32
Β·
I8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Dubedo/VibeVoice-ASR-INT8

Quantized
(6)
this model