Combined Model Suggestion
#1
by
yukiarimo
- opened
Thanks for the model! Looks great! Hope Japanese transcribes correctly!
So, here's my suggestion for the next model: I'd like to see Qwen 3 VL 4B + audio encoder like this one (NOT Whisper)!
Also, how do you actually train the audio encoder? I've worked with audio tokenizer autoencoder (aka neural audio codec), where you just make loss based on reconstruction, and it can actually both encode and decode. But this one can only encode! So, how do you actually train it, and how do you know it is working fine if you cannot decode and listen to the audio the model is actually hearing? :)