VideoRLVR

VideoRLVR is a reinforcement learning (RL) recipe for training video reasoning models with verifiable rewards. This model is a reinforcement-learning optimized version of Wan2.2-TI2V-5B, presented in the paper Video Models Can Reason with Verifiable Rewards.

The model uses an SDE-GRPO optimization backbone and rule-based feedback to improve visual reasoning in complex, procedurally generated tasks such as Maze, FlowFree, and Sokoban.

Paper: Video Models Can Reason with Verifiable Rewards
Project Page: https://darthzhu.github.io/VideoRLVR-page/
Code: https://github.com/luka-group/VideoRLVR

Method Overview

VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories. Key components include:

SDE-GRPO: An optimization backbone for video diffusion models.
Dense Decomposed Rewards: Verifiable, rule-based feedback to guide the model.
Early-Step Focus: A strategy that restricts policy optimization to the early denoising phase, significantly reducing training latency while preserving performance.

Citation

@article{zhu2026video,
  title={Video Models Can Reason with Verifiable Rewards}, 
  author={Tinghui Zhu and Sheng Zhang and James Y. Huang and Selena Song and Xiaofei Wen and Yuankai Li and Hoifung Poon and Muhao Chen},
  journal={arXiv preprint arXiv:2605.15458},
  year={2026}
}