Combined Model Suggestion

by yukiarimo - opened 6 days ago

6 days ago

Thanks for the model! Looks great! Hope Japanese transcribes correctly!

So, here's my suggestion for the next model: I'd like to see Qwen 3 VL 4B + audio encoder like this one (NOT Whisper)!

Also, how do you actually train the audio encoder? I've worked with audio tokenizer autoencoder (aka neural audio codec), where you just make loss based on reconstruction, and it can actually both encode and decode. But this one can only encode! So, how do you actually train it, and how do you know it is working fine if you cannot decode and listen to the audio the model is actually hearing? :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment