arxiv:2605.07915

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

Published on May 8

· Submitted by

zhengrong yue on May 11

alibaba-inc

Upvote

Authors:

Zhengrong Yue ,

Zihao Pan ,

Abstract

Research investigates latent manifold properties for diffusion models and proposes a Prior-Aligned AutoEncoder that explicitly optimizes latent space structure for improved generative modeling.

AI-generated summary

Tokenizers are a crucial component of latent diffusion models, as they define the latent space in which diffusion models operate. However, existing tokenizers are primarily designed to improve reconstruction fidelity or inherit pretrained representations, leaving unclear what kind of latent space is truly friendly for generative modeling. In this paper, we study this question from the perspective of latent manifold organization. By constructing controlled tokenizer variants, we identify three key properties of a diffusion-friendly latent manifold: coherent spatial structure, local manifold continuity, and global manifold semantics. We find that these properties are more consistent with downstream generation quality than reconstruction fidelity. Motivated by this finding, we propose the Prior-Aligned AutoEncoder (PAE), which explicitly shapes the latent manifold instead of leaving diffusion-friendly manifold to emerge indirectly from reconstruction or inheritance. Specifically, PAE leverages refined priors derived from VFMs and perturbation-based regularization to turn spatial structure, local continuity, and global semantics into explicit training objectives. On ImageNet 256x256, PAE improves both training efficiency and generation quality over existing tokenizers, reaching performance comparable to RAE with up to 13x faster convergence under the same training setup and achieving a new state-of-the-art gFID of 1.03. These results highlight the importance of organizing the latent manifold for latent diffusion models.

View arXiv page View PDF Project page GitHub 39 Add to collection

Community

yuezhengrong

Paper author Paper submitter 3 days ago

🚀 PAE: State-of-the-Art Latent Diffusion with 13x Faster Training!

We are excited to present Prior-Aligned AutoEncoder (PAE), a new paradigm for constructing diffusion-friendly latent manifolds! By explicitly shaping the latent space geometry, PAE breaks the trade-off between reconstruction fidelity and generation learnability.

🔥 Highlights:

State-of-the-Art Quality: Achieves a new SOTA gFID of 1.03 on ImageNet 256×256, surpassing strong baselines like RAE and FAE.

Unprecedented Efficiency: Enables 13x faster convergence for downstream DiT training. Reaches competitive performance in just 80 epochs compared to 800+ epochs for previous methods.

Diffusion-Friendly Manifold: Explicitly optimizes three key geometric properties: Spatial Structure Coherence, Local Manifold Continuity, and Global Semantic Organization, ensuring smooth and semantically consistent latent spaces.

Robust Few-Step Sampling: Maintains high-quality generation (gFID 1.05) with only 45 denoising steps, thanks to improved local continuity.

Available Models: Pre-trained tokenizers based on DINOv2, SigLIP2, DINOv3, and MAE backbones! Code & Models are fully open-sourced on HuggingFace and ModelScope.

librarian-bot

about 14 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.07915

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.07915 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.07915 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.07915 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.