Transformer version error
vllm serve Qwen/Qwen3-ASR-1.7B --served-model-name qwen-asr
Value error, The checkpoint you are trying to load has model type qwen3_asr but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
(APIServer pid=165904)
(APIServer pid=165904) You can update Transformers with the command pip install --upgrade transformers. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command pip install git+https://github.com/huggingface/transformers.git [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]
(APIServer pid=165904) For further information visit https://errors.pydantic.dev/2.12/v/value_error
local
transformers 5.0.0
pip install transformers==4.57.6
pip install transformers==4.57.6
I installed transformers==4.57.6, but I still get an error. Transformers still cannot recognize qwen3_asr.If anyone knows how to solve this, I would appreciate your help. Thank you in advance.
I'm not sure why you're still getting this error. The last line of the config file
https://huggingface.co/Qwen/Qwen3-ASR-1.7B/blob/main/config.json shows the transformers version, and here is my version that works perfectly.
If you follow this code on the Qwen official GitHub page:
import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"Qwen/Qwen3-ASR-1.7B",
dtype=torch.bfloat16,
device_map="cuda:0",
# attn_implementation="flash_attention_2",
max_inference_batch_size=32, # Batch size limit for inference. -1 means unlimited. Smaller values can help avoid OOM.
max_new_tokens=256, # Maximum number of tokens to generate. Set a larger value for long audio input.
)
results = model.transcribe(
audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
language=None, # set "English" to force the language
)
print(results[0].language)
print(results[0].text)
You must import qwen_asr from their GitHub repo.
Yes, I also used the qwen_asr package to run it. Transformers directly does not support this kind of model type like qwen_asr.
Yes, I also used the qwen_asr package to run it. Transformers directly does not support this kind of model type like qwen_asr.
I would like to ask you for advice. How did you solve this problem? Did you install the qwen-asr Python package directly through 'pip install -U qwen-asr' in the conda environment? Because I noticed that vllm has the ability to accelerate inference models and can handle parallel processing, so I need to use the 'vllm serve' function. If it's convenient for you, I would greatly appreciate your reply.
Yes, I also used the qwen_asr package to run it. Transformers directly does not support this kind of model type like qwen_asr.
I would like to ask you for advice. How did you solve this problem? Did you install the qwen-asr Python package directly through 'pip install -U qwen-asr' in the conda environment? Because I noticed that vllm has the ability to accelerate inference models and can handle parallel processing, so I need to use the 'vllm serve' function. If it's convenient for you, I would greatly appreciate your reply.
i use qwen-asr-serve just like vllm serve ,i guess just wrapper vllm .
qwen-asr-serve Qwen/Qwen3-ASR-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000
Yes, I also used the qwen_asr package to run it. Transformers directly does not support this kind of model type like qwen_asr.
I would like to ask you for advice. How did you solve this problem? Did you install the qwen-asr Python package directly through 'pip install -U qwen-asr' in the conda environment? Because I noticed that vllm has the ability to accelerate inference models and can handle parallel processing, so I need to use the 'vllm serve' function. If it's convenient for you, I would greatly appreciate your reply.
i use qwen-asr-serve just like vllm serve ,i guess just wrapper vllm .
qwen-asr-serve Qwen/Qwen3-ASR-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000
Thank you very much for your reply. I reviewed the installation requirements on GitHub and, following the instructions, installed qwen-asr using the command pip install -U qwen-asr[vllm]. Then, I successfully ran the code below.
The code is as follows:
import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
# "Qwen/Qwen3-ASR-1.7B",
"/root/autodl-tmp/models/asr/1.7B/Qwen/Qwen3-ASR-1.7B",
dtype=torch.bfloat16,
device_map="cuda:0",
# attn_implementation="flash_attention_2",
max_inference_batch_size=32, # Batch size limit for inference. -1 means unlimited. Smaller values can help avoid OOM.
max_new_tokens=256, # Maximum number of tokens to generate. Set a larger value for long audio input.
)
results = model.transcribe(
audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
language=None, # set "English" to force the language
)
print(results[0].language)
print(results[0].text)
However, when I use qwen-asr-serve "/root/autodl-tmp/models/asr/1.7B/Qwen/Qwen3-ASR-1.7B" --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000 in the terminal, it reports an error. The specific error is:
(APIServer pid=8470) INFO 02-02 16:59:38 [api_server.py:1272] vLLM API server version 0.14.0
(APIServer pid=8470) INFO 02-02 16:59:38 [utils.py:263] non-default args: {'model_tag': '/root/autodl-tmp/models/asr/1.7B/Qwen/Qwen3-ASR-1.7B', 'host': '0.0.0.0', 'model': '/root/autodl-tmp/models/asr/1.7B/Qwen/Qwen3-ASR-1.7B', 'gpu_memory_utilization': 0.8}
(APIServer pid=8470) INFO 02-02 16:59:38 [model.py:530] Resolved architecture: Qwen3ASRForConditionalGeneration
(APIServer pid=8470) ERROR 02-02 16:59:38 [repo_utils.py:65] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/root/autodl-tmp/models/asr/1.7B/Qwen/Qwen3-ASR-1.7B'. Use `repo_type` argument if needed., retrying 1 of 2
(APIServer pid=8470) ERROR 02-02 16:59:40 [repo_utils.py:63] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/root/autodl-tmp/models/asr/1.7B/Qwen/Qwen3-ASR-1.7B'. Use `repo_type` argument if needed.
(APIServer pid=8470) INFO 02-02 16:59:40 [model.py:1866] Downcasting torch.float32 to torch.bfloat16.
(APIServer pid=8470) INFO 02-02 16:59:40 [model.py:1545] Using max model len 65536
(APIServer pid=8470) INFO 02-02 16:59:40 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=8470) INFO 02-02 16:59:40 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=8470) INFO 02-02 16:59:40 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=8470) The tokenizer you are loading from '/root/autodl-tmp/models/asr/1.7B/Qwen/Qwen3-ASR-1.7B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=8470) The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
(EngineCore_DP0 pid=8560) INFO 02-02 16:59:55 [gpu_model_runner.py:3808] Starting to load model /root/autodl-tmp/models/asr/1.7B/Qwen/Qwen3-ASR-1.7B...
(EngineCore_DP0 pid=8560) INFO 02-02 16:59:56 [mm_encoder_attention.py:86] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=8560) INFO 02-02 16:59:56 [vllm.py:630] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=8560) INFO 02-02 16:59:56 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.56it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 2.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 2.49it/s]
(EngineCore_DP0 pid=8560)
(EngineCore_DP0 pid=8560) INFO 02-02 16:59:57 [default_loader.py:291] Loading weights took 0.81 seconds
(EngineCore_DP0 pid=8560) INFO 02-02 16:59:58 [gpu_model_runner.py:3905] Model loading took 3.87 GiB memory and 1.086352 seconds
(EngineCore_DP0 pid=8560) INFO 02-02 16:59:59 [gpu_model_runner.py:4715] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 5 audio items of the maximum feature size.
(EngineCore_DP0 pid=8560) ERROR 02-02 16:59:59 [core.py:936] EngineCore failed to start.
(EngineCore_DP0 pid=8560) ERROR 02-02 16:59:59 [core.py:936] Traceback (most recent call last):
(EngineCore_DP0 pid=8560) ERROR 02-02 16:59:59 [core.py:936] File "/root/miniconda3/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=8560) Process EngineCore_DP0:
(EngineCore_DP0 pid=8560) Traceback (most recent call last):
(EngineCore_DP0 pid=8560) File "/root/miniconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=8560) self.run()
(EngineCore_DP0 pid=8560) File "/root/miniconda3/lib/python3.12/multiprocessing/process.py", line 108, in run
(APIServer pid=8470) File "/root/miniconda3/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 980, in wait_for_engine_startup
(APIServer pid=8470) raise RuntimeError(
(APIServer pid=8470) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
I have researched extensively and consulted GPT, but I am still unable to resolve this issue, which is causing me great distress. If anyone knows how to solve it, I would greatly appreciate your assistance.

