Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published Dec 13, 2024 • 108
Causal Diffusion Transformers for Generative Modeling Paper • 2412.12095 • Published Dec 16, 2024 • 23
TransMLA: Multi-head Latent Attention Is All You Need Paper • 2502.07864 • Published Feb 11, 2025 • 69
Latent Diffusion Autoencoders: Toward Efficient and Meaningful Unsupervised Representation Learning in Medical Imaging Paper • 2504.08635 • Published Apr 11, 2025 • 4
D^2iT: Dynamic Diffusion Transformer for Accurate Image Generation Paper • 2504.09454 • Published Apr 13, 2025 • 11
Efficient Generative Model Training via Embedded Representation Warmup Paper • 2504.10188 • Published Apr 14, 2025 • 12
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax Paper • 2504.20966 • Published Apr 29, 2025 • 31
LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer Paper • 2506.06952 • Published Jun 8, 2025 • 9
Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression Paper • 2506.09482 • Published Jun 11, 2025 • 45
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets Paper • 2506.14761 • Published Jun 17, 2025 • 17
Energy-Based Transformers are Scalable Learners and Thinkers Paper • 2507.02092 • Published Jul 2, 2025 • 69
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling Paper • 2507.07955 • Published Jul 10, 2025 • 27
Region-based Cluster Discrimination for Visual Representation Learning Paper • 2507.20025 • Published Jul 26, 2025 • 20
Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer Paper • 2508.14187 • Published Aug 19, 2025 • 4
Artificial Hippocampus Networks for Efficient Long-Context Modeling Paper • 2510.07318 • Published Oct 8, 2025 • 32
Scaling Embeddings Outperforms Scaling Experts in Language Models Paper • 2601.21204 • Published Jan 29 • 102
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders Paper • 2601.16208 • Published Jan 22 • 55
Nested Learning: The Illusion of Deep Learning Architectures Paper • 2512.24695 • Published Dec 31, 2025 • 45
V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising Paper • 2603.16792 • Published Mar 17 • 3
ViT-AdaLA: Adapting Vision Transformers with Linear Attention Paper • 2603.16063 • Published Mar 17 • 2