VibeVoice-ASR — Selective INT8 Quantization

Selectively quantized version of microsoft/VibeVoice-ASR for low-VRAM deployment.

Only the Qwen2.5-7B LLM backbone is quantized to INT8. Audio tokenizers, connectors, and lm_head remain in full BF16 precision — preserving diarization accuracy and transcription quality.

⚠️ This model uses the standalone vibevoice package (pip install git+https://github.com/microsoft/VibeVoice.git), NOT the HF-native transformers >= 5.3.0 variant. It requires transformers == 4.57.3.

Key details


Base model	microsoft/VibeVoice-ASR
Quantization	INT8 (bitsandbytes `Linear8bitLt`)
Modules quantized	`model.language_model.model.layers.*` (196 layers)
Modules in BF16	`acoustic_tokenizer`, `semantic_tokenizer`, `acoustic_connector`, `semantic_connector`, `lm_head` (161 layers)
Model size	~9.2 GB (down from 17.3 GB)
Peak VRAM	~12.5 GB (including inference activations)
Transformers	== 4.57.3
bitsandbytes	>= 0.48.1

Why selective quantization?

Naive INT8 quantization of the entire model produces [Unintelligible Speech] — the model detects speech boundaries but cannot decode content. The acoustic and semantic tokenizer encoders process raw audio signals where quantization errors propagate catastrophically. The LLM backbone (Qwen2.5-7B) handles INT8 quantization gracefully.

Critical discovery: The standalone vibevoice package uses different module names than the HF-native variant. The correct skip list for the standalone model is:

Standalone (this model)	HF-native (won't work here)
`acoustic_tokenizer`	`acoustic_tokenizer_encoder`
`semantic_tokenizer`	`semantic_tokenizer_encoder`
`acoustic_connector`	`acoustic_projection`
`semantic_connector`	`semantic_projection`

Using the HF-native names with the standalone package silently quantizes audio-critical modules, producing garbage output.

Usage

import torch
from vibevoice.modular.modeling_vibevoice_asr import VibeVoiceASRForConditionalGeneration
from vibevoice.processor.vibevoice_asr_processor import VibeVoiceASRProcessor

model_id = "Dubedo/VibeVoice-ASR-INT8"

# Load processor (no preprocessor_config.json — default ratio=3200 is correct)
processor = VibeVoiceASRProcessor.from_pretrained(
    model_id,
    language_model_pretrained_name="Qwen/Qwen2.5-7B",
)

# Load quantized model
model = VibeVoiceASRForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

# Transcribe
inputs = processor(
    audio=["path/to/audio.wav"],
    sampling_rate=None,
    return_tensors="pt",
    padding=True,
    add_generation_prompt=True,
)
inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=32768,
        pad_token_id=processor.pad_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        do_sample=False,
    )

input_length = inputs["input_ids"].shape[1]
generated_ids = output_ids[0, input_length:]
text = processor.decode(generated_ids, skip_special_tokens=True)
segments = processor.post_process_transcription(text)

Quantization method

Quantized on NVIDIA L4 (22GB) using the standalone vibevoice package with BitsAndBytesConfig:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_skip_modules=[
        "acoustic_tokenizer",
        "semantic_tokenizer",
        "acoustic_connector",
        "semantic_connector",
        "lm_head",
    ],
)

model = VibeVoiceASRForConditionalGeneration.from_pretrained(
    "microsoft/VibeVoice-ASR",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Important notes

Do NOT create a preprocessor_config.json — the standalone processor's default fallback sets speech_tok_compress_ratio=3200, which is correct. Creating one with ratio=320 causes a 10x mask shape mismatch and IndexError.
Requires bitsandbytes >= 0.48.1 — v0.48.0 has a confirmed critical bug breaking INT8 quantization.
INT8 models cannot be moved between CPU and GPU — use delete+reload pattern for VRAM management.

Acknowledgments

Based on microsoft/VibeVoice-ASR. Built for the Dubedo AI video dubbing platform.

Downloads last month: 15

Safetensors

Model size

9B params

Tensor type

BF16

F32

Model tree for Dubedo/VibeVoice-ASR-INT8

Base model

microsoft/VibeVoice-ASR

Quantized

(6)

this model