Papers
arxiv:2605.00623

Recovering Hidden Reward in Diffusion-Based Policies

Published on May 1
· Submitted by
Yanbiao Ji
on May 8
Authors:
,
,
,
,
,
,
,
,

Abstract

EnergyFlow unifies generative action modeling with inverse reinforcement learning by parameterizing an energy function whose gradient serves as a denoising field, enabling reward extraction without adversarial training while improving policy generalization through structural constraints.

AI-generated summary

This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar energy function whose gradient is the denoising field. We establish that under maximum-entropy optimality, the score function learned via denoising score matching recovers the gradient of the expert's soft Q-function, enabling reward extraction without adversarial training. Formally, we prove that constraining the learned field to be conservative reduces hypothesis complexity and tightens out-of-distribution generalization bounds. We further characterize the identifiability of recovered rewards and bound how score estimation errors propagate to action preferences. Empirically, EnergyFlow achieves state-of-the-art imitation performance on various manipulation tasks while providing an effective reward signal for downstream reinforcement learning that outperforms both adversarial IRL methods and likelihood-based alternatives. These results show that the structural constraints required for valid reward extraction simultaneously serve as beneficial inductive biases for policy generalization. The code is available at https://github.com/sotaagi/EnergyFlow.

Community

Paper author Paper submitter

ENERGYFLOW unifies diffusion-based imitation learning and inverse reinforcement learning by learning a conservative energy field whose gradient drives action generation while exposing a recoverable reward signal, improving manipulation performance, downstream RL, and out-of-distribution robustness.

curious how robust the reward recovery is when the max-entropy assumption is only approximately satisfied in real expert data. the core move—using the energy gradient as the denoising field and tying it to the soft q gradient under a conservative energy—feels like a nice bridge between diffusion modeling and reward learning. i’m especially curious about the identifiability caveat: the paper notes a state-dependent offset that prevents global recovery; would a tiny learned centering term or a structured prior still preserve the conservative guarantee while yielding a recoverable global reward? the arxivlens breakdown helped me parse the energy-to-gradient mapping and the 1d temporal u-net choice; quick refresher here if others want the same vibe: https://arxivlens.com/PaperView/Details/recovering-hidden-reward-in-diffusion-based-policies-6944-7dae7923. it would be neat to see how this scales to truly high-dimensional action spaces or more severe distribution shifts, to test whether the integrability bias consistently improves out-of-sample generalization without trading off top performance.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.00623
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.00623 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.00623 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.00623 in a Space README.md to link it from this page.

Collections including this paper 1