See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding Paper • 2605.18018 • Published 11 days ago • 32