Miso TTS 8B BF16 for ComfyUI
BF16 conversion of Miso TTS 8B prepared specifically for use with the MisoTTS-ComfyUI custom node:
https://github.com/Saganaki22/MisoTTS-ComfyUI
This repository contains converted BF16 weights only. No architectural changes, finetuning, retraining, or modifications to the original model behavior have been made.
Model Introduction
Miso TTS 8B is a text-to-speech model based on the Sesame CSM architecture. It generates Mimi audio codes from text and optional audio context using a large Llama-style backbone and an autoregressive audio decoder.
This BF16 release is intended for ComfyUI users who want reduced memory usage while maintaining output quality comparable to the original release.
Quickstart
- Install ComfyUI.
- Install the MisoTTS-ComfyUI custom node: https://github.com/Saganaki22/MisoTTS-ComfyUI
- Place this BF16 checkpoint in your MisoTTS model directory.
- Load the model using the MisoTTS-ComfyUI loader node.
- Generate speech from text or reference-audio workflows.
Model Summary
| Item | Value |
|---|---|
| Model | Miso TTS 8B |
| Variant | BF16 Conversion |
| Intended Platform | ComfyUI |
| Custom Node | MisoTTS-ComfyUI |
| Task | Text-to-Speech |
| Architecture | Sesame-style CSM |
| Backbone | llama-8B |
| Audio Decoder | llama-300M |
| Audio Tokenizer | Mimi |
| Text Vocabulary | 128,256 |
| Audio Vocabulary | 2,051 |
| Audio Codebooks | 32 |
| Max Sequence Length | 2,048 |
| Precision | BF16 |
| Format | Safetensors |
Architecture
Miso TTS 8B uses two transformer components:
- A large backbone transformer that consumes text and audio-frame embeddings.
- A smaller autoregressive decoder transformer that predicts higher-order audio codebooks.
Codebook 0 is predicted directly from the backbone hidden state, while codebooks 1 through 31 are generated autoregressively by the decoder.
This release preserves the original architecture and only changes weight precision.
BF16 Conversion Notes
- Converted from the original Miso TTS weights.
- No retraining performed.
- No finetuning performed.
- No quantization applied.
- Intended for lower memory usage compared to FP32 checkpoints.
- Output quality should remain effectively identical to the original model aside from minor numerical differences inherent to BF16 inference.
Intended Use
This model is intended for:
- Text-to-speech generation
- Conversational speech synthesis
- Voice continuation workflows
- Reference-audio conditioned speech generation
- ComfyUI audio generation pipelines
Limitations
- Voice similarity from reference audio is not guaranteed.
- Long generations may require workflow chunking.
- Output quality remains dependent on prompting and generation settings.
- BF16 support is recommended at the hardware level for optimal performance.
Attribution
Original model:
- MisoLabs / Miso TTS 8B
ComfyUI integration:
All credit for the original architecture, training, datasets, and research belongs to the original Miso Labs team.
License
This BF16 conversion inherits the licensing and usage restrictions of the original Miso TTS release. Please review the upstream license before use.

