FireRedASR2S
A SOTA Industrial-Grade All-in-One ASR System

[Code] [Paper] [Model🤗] [Model🤖] [Demo]

FireRedASR2S is a state-of-the-art (SOTA), industrial-grade, all-in-one ASR system with ASR, VAD, LID, and Punc modules. All modules achieve SOTA performance:

FireRedASR2: Automatic Speech Recognition (ASR) supporting peech and singing transcription for Chinese (Mandarin, 20+ dialects/accents), English, code-switching. 2.89% average CER on 4 public Mandarin benchmarks, 11.55% on 19 Chinese dialects and accents benchmarks, outperforming Doubao-ASR, Qwen3-ASR-1.7B, Fun-ASR, and Fun-ASR-Nano-2512. FireRedASR2-AED also supports word-level timestamps and confidence scores.
FireRedVAD: Voice Activity Detection (VAD) supporting speech/singing/music in 100+ languages. 97.57% F1, outperforming Silero-VAD, TEN-VAD, FunASR-VAD and WebRTC-VAD. Supports non-streaming/streaming VAD and Multi-label VAD (mVAD).
FireRedLID: Spoken Language Identification (LID) supporting 100+ languages and 20+ Chinese dialects/accents. 97.18% accuracy, outperforming Whisper and SpeechBrain.
FireRedPunc: Punctuation Prediction (Punc) for Chinese and English. 78.90% average F1, outperforming FunASR-Punc (62.77%).

2S: 2nd-generation FireRedASR, now expanded to an all-in-one ASR System

🔥 News

[2026.03.12] 🔥 We release FireRedASR2S technical report. See arXiv.
[2026.03.05] 🚀 vLLM supports FireRedASR2-LLM. See vLLM Usage part.
[2026.02.25] 🔥 We release FireRedASR2-LLM model weights. 🤗 🤖
[2026.02.13] 🚀 Support TensorRT-LLM inference acceleration for FireRedASR2-AED (contributed by NVIDIA). Benchmark on AISHELL-1 test set shows 12.7x speedup over PyTorch baseline (single H20).
[2026.02.12] 🔥 We release FireRedASR2S (FireRedASR2-AED, FireRedVAD, FireRedLID, and FireRedPunc) with model weights and inference code. Download links below. Technical report and finetuning code coming soon.

Available Models and Languages

Model	Supported Languages & Dialects	Download
FireRedASR2-LLM	Chinese (Mandarin and 20+ dialects/accents^*), English, Code-Switching	🤗 \| 🤖
FireRedASR2-AED	Chinese (Mandarin and 20+ dialects/accents^*), English, Code-Switching	🤗 \| 🤖
FireRedVAD	100+ languages, 20+ Chinese dialects/accents^*	🤗 \| 🤖
FireRedLID	100+ languages, 20+ Chinese dialects/accents^*	🤗 \| 🤖
FireRedPunc	Chinese, English	🤗 \| 🤖

^*Supported Chinese dialects/accents: Cantonese (Hong Kong & Guangdong), Sichuan, Shanghai, Wu, Minnan, Anhui, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Liaoning, Ningxia, Shaanxi, Shanxi, Shandong, Tianjin, Yunnan, etc.

Method

FireRedASR2S: System Overview

FireRedASR2

FireRedASR2 builds upon FireRedASR with improved accuracy, designed to meet diverse requirements in superior performance and optimal efficiency across various applications. It comprises two variants:

FireRedASR2-LLM: Designed to achieve state-of-the-art performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
FireRedASR2-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.

Other Modules

FireRedVAD: DFSMN-based non-streaming/streaming Voice Activity Detection and Multi-label VAD (mVAD). mVAD can be viewed as a lightweight Audio Event Detection (AED) system specialized for a small set of sound categories (speech/singing/music).
FireRedLID: Encoder-Decoder-based Spoken Language Identification. See FireRedLID README for language details.
FireRedPunc: BERT-based Punctuation Prediction.

Quick Start

Setup

Create a clean Python environment:

$ conda create --name fireredasr2s python=3.10
$ conda activate fireredasr2s
$ git clone https://github.com/FireRedTeam/FireRedASR2S.git
$ cd FireRedASR2S  # or fireredasr2s

Install dependencies and set up PATH and PYTHONPATH:

$ pip install -r requirements.txt
$ export PATH=$PWD/fireredasr2s/:$PATH
$ export PYTHONPATH=$PWD/:$PYTHONPATH

Download models:

# Download via ModelScope (recommended for users in China)
pip install -U modelscope
modelscope download --model xukaituo/FireRedASR2-AED --local_dir ./pretrained_models/FireRedASR2-AED
modelscope download --model xukaituo/FireRedVAD --local_dir ./pretrained_models/FireRedVAD
modelscope download --model xukaituo/FireRedLID --local_dir ./pretrained_models/FireRedLID
modelscope download --model xukaituo/FireRedPunc --local_dir ./pretrained_models/FireRedPunc
modelscope download --model xukaituo/FireRedASR2-LLM --local_dir ./pretrained_models/FireRedASR2-LLM

# Download via Hugging Face
pip install -U "huggingface_hub[cli]"
huggingface-cli download FireRedTeam/FireRedASR2-AED --local-dir ./pretrained_models/FireRedASR2-AED
huggingface-cli download FireRedTeam/FireRedVAD --local-dir ./pretrained_models/FireRedVAD
huggingface-cli download FireRedTeam/FireRedLID --local-dir ./pretrained_models/FireRedLID
huggingface-cli download FireRedTeam/FireRedPunc --local-dir ./pretrained_models/FireRedPunc
huggingface-cli download FireRedTeam/FireRedASR2-LLM --local-dir ./pretrained_models/FireRedASR2-LLM

Convert your audio to 16kHz 16-bit mono PCM format if needed:

$ ffmpeg -i <input_audio_path> -ar 16000 -ac 1 -acodec pcm_s16le -f wav <output_wav_path>

Script Usage

$ cd examples_infer/asr_system
$ bash inference_asr_system.sh

Command-line Usage

$ fireredasr2s-cli --help
$ fireredasr2s-cli --wav_paths "assets/hello_zh.wav" "assets/hello_en.wav" --outdir output
$ cat output/result.jsonl 
# {"uttid": "hello_zh", "text": "你好世界。", "sentences": [{"start_ms": 310, "end_ms": 1840, "text": "你好世界。", "asr_confidence": 0.875, "lang": "zh mandarin", "lang_confidence": 0.999}], "vad_segments_ms": [[310, 1840]], "dur_s": 2.32, "words": [{"start_ms": 490, "end_ms": 690, "text": "你"}, {"start_ms": 690, "end_ms": 1090, "text": "好"}, {"start_ms": 1090, "end_ms": 1330, "text": "世"}, {"start_ms": 1330, "end_ms": 1795, "text": "界"}], "wav_path": "assets/hello_zh.wav"}
# {"uttid": "hello_en", "text": "Hello speech.", "sentences": [{"start_ms": 120, "end_ms": 1840, "text": "Hello speech.", "asr_confidence": 0.833, "lang": "en", "lang_confidence": 0.998}], "vad_segments_ms": [[120, 1840]], "dur_s": 2.24, "words": [{"start_ms": 340, "end_ms": 1020, "text": "hello"}, {"start_ms": 1020, "end_ms": 1666, "text": "speech"}], "wav_path": "assets/hello_en.wav"}

Python API Usage

from fireredasr2s import FireRedAsr2System, FireRedAsr2SystemConfig

asr_system_config = FireRedAsr2SystemConfig()  # Use default config
asr_system = FireRedAsr2System(asr_system_config)

result = asr_system.process("assets/hello_zh.wav")
print(result)
# {'uttid': 'tmpid', 'text': '你好世界。', 'sentences': [{'start_ms': 440, 'end_ms': 1820, 'text': '你好世界。', 'asr_confidence': 0.868, 'lang': 'zh mandarin', 'lang_confidence': 0.999}], 'vad_segments_ms': [(440, 1820)], 'dur_s': 2.32, 'words': [], 'wav_path': 'assets/hello_zh.wav'}

result = asr_system.process("assets/hello_en.wav")
print(result)
# {'uttid': 'tmpid', 'text': 'Hello speech.', 'sentences': [{'start_ms': 260, 'end_ms': 1820, 'text': 'Hello speech.', 'asr_confidence': 0.933, 'lang': 'en', 'lang_confidence': 0.993}], 'vad_segments_ms': [(260, 1820)], 'dur_s': 2.24, 'words': [], 'wav_path': 'assets/hello_en.wav'}

Usage of Each Module

The four components under fireredasr2s, i.e. fireredasr2, fireredvad, fireredlid, and fireredpunc are self-contained and designed to work as a standalone modules. You can use any of them independently without depending on the others. FireRedVAD and FireRedLID will also be open-sourced as standalone libraries in separate repositories.

Script Usage

# ASR
$ cd examples_infer/asr
$ bash inference_asr_aed.sh
$ bash inference_asr_llm.sh

# VAD & mVAD (mVAD=Audio Event Detection, AED)
$ cd examples_infer/vad
$ bash inference_vad.sh
$ bash inference_streamvad.sh
$ bash inference_aed.sh

# LID
$ cd examples_infer/lid
$ bash inference_lid.sh

# Punc
$ cd examples_infer/punc
$ bash inference_punc.sh

vLLM Usage

# Serving FireRedASR2-LLM with latest vLLM for the highest performance.
# For more details, see https://github.com/vllm-project/vllm/pull/35727.
$ vllm serve allendou/FireRedASR2-LLM-vllm -tp=2 --dtype=float32
$ python3 examples/online_serving/openai_transcription_client.py --repetition_penalty=1.0 --audio_path=/root/hello_zh.wav

Python API Usage

Set up PYTHONPATH first: export PYTHONPATH=$PWD/:$PYTHONPATH

ASR

from fireredasr2s.fireredasr2 import FireRedAsr2, FireRedAsr2Config

batch_uttid = ["hello_zh", "hello_en"]
batch_wav_path = ["assets/hello_zh.wav", "assets/hello_en.wav"]

# FireRedASR2-AED
asr_config = FireRedAsr2Config(
    use_gpu=True,
    use_half=False,
    beam_size=3,
    nbest=1,
    decode_max_len=0,
    softmax_smoothing=1.25,
    aed_length_penalty=0.6,
    eos_penalty=1.0,
    return_timestamp=True
)
model = FireRedAsr2.from_pretrained("aed", "pretrained_models/FireRedASR2-AED", asr_config)
results = model.transcribe(batch_uttid, batch_wav_path)
print(results)
# [{'uttid': 'hello_zh', 'text': '你好世界', 'confidence': 0.971, 'dur_s': 2.32, 'rtf': '0.0870', 'wav': 'assets/hello_zh.wav', 'timestamp': [('你', 0.42, 0.66), ('好', 0.66, 1.1), ('世', 1.1, 1.34), ('界', 1.34, 2.039)]}, {'uttid': 'hello_en', 'text': 'hello speech', 'confidence': 0.943, 'dur_s': 2.24, 'rtf': '0.0870', 'wav': 'assets/hello_en.wav', 'timestamp': [('hello', 0.34, 0.98), ('speech', 0.98, 1.766)]}]

# FireRedASR2-LLM
asr_config = FireRedAsr2Config(
    use_gpu=True,
    decode_min_len=0,
    repetition_penalty=1.0,
    llm_length_penalty=0.0,
    temperature=1.0
)
model = FireRedAsr2.from_pretrained("llm", "pretrained_models/FireRedASR2-LLM", asr_config)
results = model.transcribe(batch_uttid, batch_wav_path)
print(results)
# [{'uttid': 'hello_zh', 'text': '你好世界', 'rtf': '0.0681', 'wav': 'assets/hello_zh.wav'}, {'uttid': 'hello_en', 'text': 'hello speech', 'rtf': '0.0681', 'wav': 'assets/hello_en.wav'}]

VAD

from fireredasr2s.fireredvad import FireRedVad, FireRedVadConfig

vad_config = FireRedVadConfig(
    use_gpu=False,
    smooth_window_size=5,
    speech_threshold=0.4,
    min_speech_frame=20,
    max_speech_frame=2000,
    min_silence_frame=20,
    merge_silence_frame=0,
    extend_speech_frame=0,
    chunk_max_frame=30000)
vad = FireRedVad.from_pretrained("pretrained_models/FireRedVAD/VAD", vad_config)

result, probs = vad.detect("assets/hello_zh.wav")

print(result)
# {'dur': 2.32, 'timestamps': [(0.44, 1.82)], 'wav_path': 'assets/hello_zh.wav'}

Stream VAD

Click to expand

from fireredasr2s.fireredvad import FireRedStreamVad, FireRedStreamVadConfig

vad_config=FireRedStreamVadConfig(
    use_gpu=False,
    smooth_window_size=5,
    speech_threshold=0.4,
    pad_start_frame=5,
    min_speech_frame=8,
    max_speech_frame=2000,
    min_silence_frame=20,
    chunk_max_frame=30000)
stream_vad = FireRedStreamVad.from_pretrained("pretrained_models/FireRedVAD/Stream-VAD", vad_config)

frame_results, result = stream_vad.detect_full("assets/hello_zh.wav")

print(result)
# {'dur': 2.32, 'timestamps': [(0.46, 1.84)], 'wav_path': 'assets/hello_zh.wav'}

mVAD (Audio Event Detection, AED)

Click to expand

from fireredasr2s.fireredvad import FireRedAed, FireRedAedConfig

aed_config=FireRedAedConfig(
    use_gpu=False,
    smooth_window_size=5,
    speech_threshold=0.4,
    singing_threshold=0.5,
    music_threshold=0.5,
    min_event_frame=20,
    max_event_frame=2000,
    min_silence_frame=20,
    merge_silence_frame=0,
    extend_speech_frame=0,
    chunk_max_frame=30000)
aed = FireRedAed.from_pretrained("pretrained_models/FireRedVAD/AED", aed_config)

result, probs = aed.detect("assets/event.wav")

print(result)
# {'dur': 22.016, 'event2timestamps': {'speech': [(0.4, 3.56), (3.66, 9.08), (9.27, 9.77), (10.78, 21.76)], 'singing': [(1.79, 19.96), (19.97, 22.016)], 'music': [(0.09, 12.32), (12.33, 22.016)]}, 'event2ratio': {'speech': 0.848, 'singing': 0.905, 'music': 0.991}, 'wav_path': 'assets/event.wav'}

LID

Click to expand

from fireredasr2s.fireredlid import FireRedLid, FireRedLidConfig

batch_uttid = ["hello_zh", "hello_en"]
batch_wav_path = ["assets/hello_zh.wav", "assets/hello_en.wav"]

config = FireRedLidConfig(use_gpu=True, use_half=False)
model = FireRedLid.from_pretrained("pretrained_models/FireRedLID", config)

results = model.process(batch_uttid, batch_wav_path)
print(results)
# [{'uttid': 'hello_zh', 'lang': 'zh mandarin', 'confidence': 0.996, 'dur_s': 2.32, 'rtf': '0.0741', 'wav': 'assets/hello_zh.wav'}, {'uttid': 'hello_en', 'lang': 'en', 'confidence': 0.996, 'dur_s': 2.24, 'rtf': '0.0741', 'wav': 'assets/hello_en.wav'}]

Punc

Click to expand

from fireredasr2s.fireredpunc.punc import FireRedPunc, FireRedPuncConfig

config = FireRedPuncConfig(use_gpu=True)
model = FireRedPunc.from_pretrained("pretrained_models/FireRedPunc", config)

batch_text = ["你好世界", "Hello world"]
results = model.process(batch_text)

print(results)
# [{'punc_text': '你好世界。', 'origin_text': '你好世界'}, {'punc_text': 'Hello world!', 'origin_text': 'Hello world'}]

ASR System

from fireredasr2s.fireredasr2 import FireRedAsr2Config
from fireredasr2s.fireredlid import FireRedLidConfig
from fireredasr2s.fireredpunc import FireRedPuncConfig
from fireredasr2s.fireredvad import FireRedVadConfig
from fireredasr2s import FireRedAsr2System, FireRedAsr2SystemConfig

vad_config = FireRedVadConfig(
    use_gpu=False,
    smooth_window_size=5,
    speech_threshold=0.4,
    min_speech_frame=20,
    max_speech_frame=2000,
    min_silence_frame=20,
    merge_silence_frame=0,
    extend_speech_frame=0,
    chunk_max_frame=30000
)
lid_config = FireRedLidConfig(use_gpu=True, use_half=False)
asr_config = FireRedAsr2Config(
    use_gpu=True,
    use_half=False,
    beam_size=3,
    nbest=1,
    decode_max_len=0,
    softmax_smoothing=1.25,
    aed_length_penalty=0.6,
    eos_penalty=1.0,
    return_timestamp=True
)
punc_config = FireRedPuncConfig(use_gpu=True)

asr_system_config = FireRedAsr2SystemConfig(
    "pretrained_models/FireRedVAD/VAD",
    "pretrained_models/FireRedLID",
    "aed", "pretrained_models/FireRedASR2-AED",
    "pretrained_models/FireRedPunc",
    vad_config, lid_config, asr_config, punc_config,
    enable_vad=1, enable_lid=1, enable_punc=1
)
asr_system = FireRedAsr2System(asr_system_config)

batch_uttid = ["hello_zh", "hello_en"]
batch_wav_path = ["assets/hello_zh.wav", "assets/hello_en.wav"]
for wav_path, uttid in zip(batch_wav_path, batch_uttid):
    result = asr_system.process(wav_path, uttid)
    print(result)
# {'uttid': 'hello_zh', 'text': '你好世界。', 'sentences': [{'start_ms': 440, 'end_ms': 1820, 'text': '你好世界。', 'asr_confidence': 0.868, 'lang': 'zh mandarin', 'lang_confidence': 0.999}], 'vad_segments_ms': [(440, 1820)], 'dur_s': 2.32, 'words': [{'start_ms': 540, 'end_ms': 700, 'text': '你'}, {'start_ms': 700, 'end_ms': 1100, 'text': '好'}, {'start_ms': 1100, 'end_ms': 1300, 'text': '世'}, {'start_ms': 1300, 'end_ms': 1765, 'text': '界'}], 'wav_path': 'assets/hello_zh.wav'}
# {'uttid': 'hello_en', 'text': 'Hello speech.', 'sentences': [{'start_ms': 260, 'end_ms': 1820, 'text': 'Hello speech.', 'asr_confidence': 0.933, 'lang': 'en', 'lang_confidence': 0.993}], 'vad_segments_ms': [(260, 1820)], 'dur_s': 2.24, 'words': [{'start_ms': 400, 'end_ms': 960, 'text': 'hello'}, {'start_ms': 960, 'end_ms': 1666, 'text': 'speech'}], 'wav_path': 'assets/hello_en.wav'}

Note: FireRedASR2S code has only been tested on Linux Ubuntu 22.04. Behavior on other Linux distributions or Windows has not been tested.

FAQ

Q: What audio format is supported?

16kHz 16-bit mono PCM wav. Use ffmpeg to convert other formats: ffmpeg -i <input_audio_path> -ar 16000 -ac 1 -acodec pcm_s16le -f wav <output_wav_path>

Q: What are the input length limitations of ASR models?

FireRedASR2-AED supports audio input up to 60s. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
FireRedASR2-LLM supports audio input up to 40s. The behavior for longer input is untested.
FireRedASR2-LLM Batch Beam Search: When performing batch beam search with FireRedASR2-LLM, even though attention masks are applied, it is recommended to ensure that the input lengths of the utterances are similar. If there are significant differences in utterance lengths, shorter utterances may experience repetition issues. You can either sort your dataset by length or set batch_size to 1 to avoid the repetition issue.

Evaluation

FireRedASR2

Metrics: Character Error Rate (CER%) for Chinese and Word Error Rate (WER%) for English. Lower is better.

We evaluate FireRedASR2 on 24 public test sets covering Mandarin, 20+ Chinese dialects/accents, and singing.

Mandarin (4 test sets): 2.89% (LLM) / 3.05% (AED) average CER, outperforming Doubao-ASR (3.69%), Qwen3-ASR-1.7B (3.76%), Fun-ASR (4.16%) and Fun-ASR-Nano-2512 (4.55%).
Dialects (19 test sets): 11.55% (LLM) / 11.67% (AED) average CER, outperforming Doubao-ASR (15.39%), Qwen3-ASR-1.7B (11.85%), Fun-ASR (12.76%) and Fun-ASR-Nano-2512 (15.07%).

Note: FRASR2=FireRedASR2, ws=WenetSpeech, md=MagicData, conv=Conversational, daily=Daily-use.

ID	Testset\CER\Model	FRASR2-LLM	FRASR2-AED	Doubao-ASR	Qwen3-ASR	Fun-ASR
Avg	All(1-24)	9.67	9.80	12.98	10.12	10.92
Avg	Mandarin(1-4)	2.89	3.05	3.69	3.76	4.16
Avg	Dialect(5-23)	11.55	11.67	15.39	11.85	12.76
1	aishell1	0.64	0.57	1.52	1.48	1.64
2	aishell2	2.15	2.51	2.77	2.71	2.38
3	ws-net	4.44	4.57	5.73	4.97	6.85
4	ws-meeting	4.32	4.53	4.74	5.88	5.78
5	kespeech	3.08	3.60	5.38	5.10	5.36
6	ws-yue-short	5.14	5.15	10.51	5.82	7.34
7	ws-yue-long	8.71	8.54	11.39	8.85	10.14
8	ws-chuan-easy	10.90	10.60	11.33	11.99	12.46
9	ws-chuan-hard	20.71	21.35	20.77	21.63	22.49
10	md-heavy	7.42	7.43	7.69	8.02	9.13
11	md-yue-conv	12.23	11.66	26.25	9.76	33.71
12	md-yue-daily	3.61	3.35	12.82	3.66	2.69
13	md-yue-vehicle	4.50	4.83	8.66	4.28	6.00
14	md-chuan-conv	13.18	13.07	11.77	14.35	14.01
15	md-chuan-daily	4.90	5.17	3.90	4.93	3.98
16	md-shanghai-conv	28.70	27.02	45.15	29.77	25.49
17	md-shanghai-daily	24.94	24.18	44.06	23.93	12.55
18	md-wu	7.15	7.14	7.70	7.57	10.63
19	md-zhengzhou-conv	10.20	10.65	9.83	9.55	10.85
20	md-zhengzhou-daily	5.80	6.26	5.77	5.88	6.29
21	md-wuhan	9.60	10.81	9.94	10.22	4.34
22	md-tianjin	15.45	15.30	15.79	16.16	19.27
23	md-changsha	23.18	25.64	23.76	23.70	25.66
24	opencpop	1.12	1.17	4.36	2.57	3.05

FireRedVAD

Click to expand

We evaluate FireRedVAD on FLEURS-VAD-102, a multilingual VAD benchmark covering 102 languages.

FireRedVAD achieves SOTA performance, outperforming Silero-VAD, TEN-VAD, FunASR-VAD, and WebRTC-VAD.

Metric\Model	FireRedVAD	Silero-VAD	TEN-VAD	FunASR-VAD	WebRTC-VAD
AUC-ROC↑	99.60	97.99	97.81	-	-
F1 score↑	97.57	95.95	95.19	90.91	52.30
False Alarm Rate↓	2.69	9.41	15.47	44.03	2.83
Miss Rate↓	3.62	3.95	2.95	0.42	64.15

FLEURS-VAD-102: We randomly selected ~100 audio files per language from FLEURS test set, resulting in 9,443 audio files with manually annotated binary VAD labels (speech=1, silence=0). This VAD testset will be open sourced (coming soon).

Note: FunASR-VAD achieves low Miss Rate but at the cost of high False Alarm Rate (44.03%), indicating over-prediction of speech segments.

FireRedLID

Click to expand

Metric: Utterance-level LID Accuracy (%). Higher is better.

We evaluate FireRedLID on multilingual and Chinese dialect benchmarks.

FireRedLID achieves SOTA performance, outperforming Whisper, SpeechBrain-LID, and Dolphin.

Testset\Model	Languages	FireRedLID	Whisper	SpeechBrain	Dolphin
FLEURS test	82 languages	97.18	79.41	92.91	-
CommonVoice test	74 languages	92.07	80.81	78.75	-
KeSpeech + MagicData	20+ Chinese dialects/accents	88.47	-	-	69.01

FireRedPunc

Click to expand

Metric: Precision/Recall/F1 Score (%). Higher is better.

We evaluate FireRedPunc on multi-domain Chinese and English benchmarks.

FireRedPunc achieves SOTA performance, outperforming FunASR-Punc (CT-Transformer).

Testset\Model	#Sentences	FireRedPunc	FunASR-Punc
Multi-domain Chinese	88,644	82.84 / 83.08 / 82.96	77.27 / 74.03 / 75.62
Multi-domain English	28,641	78.40 / 71.57 / 74.83	55.79 / 45.15 / 49.91
Average F1 Score	-	78.90	62.77

Acknowledgements

Thanks to the following open-source works:

Citation

@article{xu2026fireredasr2s,
  title={FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System},
  author={Xu, Kaituo and Jia, Yan and Huang, Kai and Chen, Junjie and Li, Wenpeng and Liu, Kun and Xie, Feng-Long and Tang, Xu and Hu, Yao},
  journal={arXiv preprint arXiv:2603.10420},
  year={2026}
}

Downloads last month: 147

Model tree for FireRedTeam/FireRedPunc

Quantizations

1 model

Collection including FireRedTeam/FireRedPunc

FireRedASR2S

Collection

FireRedASR2S is a SOTA, industrial-grade, all-in-one ASR system with ASR, VAD, LID, and Punc module. All modules achieve SOTA performance. • 7 items • Updated Mar 13 • 10

Paper for FireRedTeam/FireRedPunc

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

Paper • 2603.10420 • Published Mar 11 • 7

FireRedASR2S A SOTA Industrial-Grade All-in-One ASR System

🔥 News

Available Models and Languages

Method

FireRedASR2S: System Overview

FireRedASR2

Other Modules

Quick Start

Setup

Script Usage

Command-line Usage

Python API Usage

Usage of Each Module

Script Usage

vLLM Usage

Python API Usage

ASR

VAD

Stream VAD

mVAD (Audio Event Detection, AED)

LID

Punc

ASR System

FAQ

Evaluation

FireRedASR2

FireRedVAD

FireRedLID

FireRedPunc

Acknowledgements

Citation

Model tree for FireRedTeam/FireRedPunc

Collection including FireRedTeam/FireRedPunc

Paper for FireRedTeam/FireRedPunc

FireRedASR2S
A SOTA Industrial-Grade All-in-One ASR System