VibeVoice-ASR β Selective INT8 Quantization
Selectively quantized version of microsoft/VibeVoice-ASR for low-VRAM deployment.
Only the Qwen2.5-7B LLM backbone is quantized to INT8. Audio tokenizers, connectors, and lm_head remain in full BF16 precision β preserving diarization accuracy and transcription quality.
β οΈ This model uses the standalone
vibevoicepackage (pip install git+https://github.com/microsoft/VibeVoice.git), NOT the HF-nativetransformers >= 5.3.0variant. It requirestransformers == 4.57.3.
Key details
| Base model | microsoft/VibeVoice-ASR |
| Quantization | INT8 (bitsandbytes Linear8bitLt) |
| Modules quantized | model.language_model.model.layers.* (196 layers) |
| Modules in BF16 | acoustic_tokenizer, semantic_tokenizer, acoustic_connector, semantic_connector, lm_head (161 layers) |
| Model size | ~9.2 GB (down from 17.3 GB) |
| Peak VRAM | ~12.5 GB (including inference activations) |
| Transformers | == 4.57.3 |
| bitsandbytes | >= 0.48.1 |
Why selective quantization?
Naive INT8 quantization of the entire model produces [Unintelligible Speech] β the model detects speech boundaries but cannot decode content. The acoustic and semantic tokenizer encoders process raw audio signals where quantization errors propagate catastrophically. The LLM backbone (Qwen2.5-7B) handles INT8 quantization gracefully.
Critical discovery: The standalone vibevoice package uses different module names than the HF-native variant. The correct skip list for the standalone model is:
| Standalone (this model) | HF-native (won't work here) |
|---|---|
acoustic_tokenizer |
acoustic_tokenizer_encoder |
semantic_tokenizer |
semantic_tokenizer_encoder |
acoustic_connector |
acoustic_projection |
semantic_connector |
semantic_projection |
Using the HF-native names with the standalone package silently quantizes audio-critical modules, producing garbage output.
Usage
import torch
from vibevoice.modular.modeling_vibevoice_asr import VibeVoiceASRForConditionalGeneration
from vibevoice.processor.vibevoice_asr_processor import VibeVoiceASRProcessor
model_id = "Dubedo/VibeVoice-ASR-INT8"
# Load processor (no preprocessor_config.json β default ratio=3200 is correct)
processor = VibeVoiceASRProcessor.from_pretrained(
model_id,
language_model_pretrained_name="Qwen/Qwen2.5-7B",
)
# Load quantized model
model = VibeVoiceASRForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True,
)
model.eval()
# Transcribe
inputs = processor(
audio=["path/to/audio.wav"],
sampling_rate=None,
return_tensors="pt",
padding=True,
add_generation_prompt=True,
)
inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=32768,
pad_token_id=processor.pad_id,
eos_token_id=processor.tokenizer.eos_token_id,
do_sample=False,
)
input_length = inputs["input_ids"].shape[1]
generated_ids = output_ids[0, input_length:]
text = processor.decode(generated_ids, skip_special_tokens=True)
segments = processor.post_process_transcription(text)
Quantization method
Quantized on NVIDIA L4 (22GB) using the standalone vibevoice package with BitsAndBytesConfig:
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_skip_modules=[
"acoustic_tokenizer",
"semantic_tokenizer",
"acoustic_connector",
"semantic_connector",
"lm_head",
],
)
model = VibeVoiceASRForConditionalGeneration.from_pretrained(
"microsoft/VibeVoice-ASR",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
Important notes
- Do NOT create a
preprocessor_config.jsonβ the standalone processor's default fallback setsspeech_tok_compress_ratio=3200, which is correct. Creating one withratio=320causes a 10x mask shape mismatch andIndexError. - Requires
bitsandbytes >= 0.48.1β v0.48.0 has a confirmed critical bug breaking INT8 quantization. - INT8 models cannot be moved between CPU and GPU β use delete+reload pattern for VRAM management.
Acknowledgments
Based on microsoft/VibeVoice-ASR. Built for the Dubedo AI video dubbing platform.
- Downloads last month
- 15
Model tree for Dubedo/VibeVoice-ASR-INT8
Base model
microsoft/VibeVoice-ASR