VideoRLVR

VideoRLVR is a reinforcement learning (RL) recipe for training video reasoning models with verifiable rewards, introduced in the paper Video Models Can Reason with Verifiable Rewards.

This checkpoint is an SFT version of Wan2.2-TI2V-5B trained on procedurally generated reasoning tasks including Maze, FlowFree, and Sokoban.

Paper: Video Models Can Reason with Verifiable Rewards
Project Page: https://darthzhu.github.io/VideoRLVR-page/
Repository: https://github.com/luka-group/VideoRLVR

Overview

VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories. It utilizes an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. This approach enables video diffusion models to satisfy explicit spatial, temporal, or logical constraints, moving beyond perceptual imitation toward reliable rule-consistent visual reasoning.

Across tasks like Maze, FlowFree, and Sokoban, VideoRLVR consistently improves over supervised fine-tuning baselines, demonstrating that verifiable RL can effectively optimize models for objective success criteria.

Citation

@article{zhu2026video,
  title={Video Models Can Reason with Verifiable Rewards}, 
  author={Tinghui Zhu and Sheng Zhang and James Y. Huang and Selena Song and Xiaofei Wen and Yuankai Li and Hoifung Poon and Muhao Chen},
  journal={arXiv preprint arXiv:2605.15458},
  year={2026}
}