🚀 Introducing Robust-U1: Teaching MLLMs to Self-Recover Corrupted Visual Content
Multimodal Large Language Models (MLLMs) have achieved impressive visual understanding, yet they remain highly brittle under real-world corruptions—noise, blur, compression artifacts, adverse weather.
Standard MLLMs suffer dramatic performance drops, and existing robustness solutions come with fundamental limits: black‑box feature alignment lacks interpretability, while white‑box text reasoning cannot restore the lost pixel‑level visual details. This raises a crucial question:
🧐 Can MLLMs recover corrupted visual content by themselves?
If the answer is yes, we can move beyond merely “compensating” for corruption and instead build a more intrinsic, generalizable form of resilience. Robust-U1 is our answer to that question.
🛰️ Introducing Awesome-Remote-Sensing-Agents: The Largest Curated Collection of Intelligent Remote Sensing Agents
We are excited to share our new repository Awesome-Remote-Sensing-Agents – a comprehensive, community-driven collection of 100+ papers at the intersection of remote sensing and intelligent agents (LLMs, VLM, multi‑agent systems, etc.).
🤝 Join the Community! We warmly welcome contributions to keep this list up‑to‑date: 📝 Add missing papers via Pull Request 🏷️ Propose new or refined categories 🔗 Report broken links or outdated entries 💬 Discuss via GitHub Issues or contact the authors
We have open-sourced Robust-R1 (AAAI 2026 Oral), a new paradigm in the field of anti-degradation and robustness enhancement for multimodal large models.
Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Existing robust MLLMs predominantly rely on implicit training/adaptation that focuses solely on visual encoder generalization, suffering from limited interpretability and isolated optimization. To overcome these limitations, we propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains. Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity. To facilitate this approach, we introduce a specialized 11K dataset featuring realistic degradations synthesized across four critical real-world visual processing stages, each annotated with structured chains connecting degradation parameters, perceptual influence, pristine semantic reasoning chain, and conclusion. Comprehensive evaluations demonstrate state-of-the-art robustness: Robust-R1 outperforms all general and robust baselines on the real-world degradation benchmark R-Bench, while maintaining superior anti-degradation performance under multi-intensity adversarial degradations on MMMB, MMStar, and RealWorldQA.
We have open-sourced Hawk (NeurIPS 2024) 🎉, one of the pioneering frameworks for open-world video anomaly understanding.
In the field of video anomaly detection, despite continuous technological advancements, existing systems still face limitations in semantic understanding of scenes and user interaction, making it challenging to effectively identify complex anomalous scenes. Additionally, the scarcity of datasets restricts the applicability of these systems in open-world scenarios.
To tackle these challenges, we developed Hawk, an open-world video understanding and anomaly detection framework. Hawk significantly enhances anomaly recognition by identifying motion information differences between anomalous and normal videos. We introduce an auxiliary consistency loss to strengthen the focus on motion modalities and establish a supervisory relationship between motion and language representations. Furthermore, we have annotated over 8,000 anomalous videos and their language descriptions and created 8,000 question-answer pairs to support effective training in diverse open-world scenarios.
Experimental results demonstrate that Hawk surpasses existing video understanding frameworks in video description generation and question-answering tasks.