Image-to-Video
Diffusers
Safetensors
ti2v

VideoRLVR

VideoRLVR is a reinforcement learning (RL) recipe for training video reasoning models with verifiable rewards, introduced in the paper Video Models Can Reason with Verifiable Rewards.

This checkpoint is an SFT version of Wan2.2-TI2V-5B trained on procedurally generated reasoning tasks including Maze, FlowFree, and Sokoban.

Overview

VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories. It utilizes an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. This approach enables video diffusion models to satisfy explicit spatial, temporal, or logical constraints, moving beyond perceptual imitation toward reliable rule-consistent visual reasoning.

Across tasks like Maze, FlowFree, and Sokoban, VideoRLVR consistently improves over supervised fine-tuning baselines, demonstrating that verifiable RL can effectively optimize models for objective success criteria.

Citation

@article{zhu2026video,
  title={Video Models Can Reason with Verifiable Rewards}, 
  author={Tinghui Zhu and Sheng Zhang and James Y. Huang and Selena Song and Xiaofei Wen and Yuankai Li and Hoifung Poon and Muhao Chen},
  journal={arXiv preprint arXiv:2605.15458},
  year={2026}
}
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DarthZhu/VideoRLVR-Wan2.2-Base

Finetuned
(35)
this model
Finetunes
1 model

Dataset used to train DarthZhu/VideoRLVR-Wan2.2-Base

Collection including DarthZhu/VideoRLVR-Wan2.2-Base

Paper for DarthZhu/VideoRLVR-Wan2.2-Base