Abstract
SLA2 improves sparse-linear attention in diffusion models by introducing a learnable router, direct attention formulation, and quantization-aware fine-tuning for enhanced efficiency and quality.
Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.
Community
Sparse-Linear Attention2 (SLA2) can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.
arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/sla2-sparse-linear-attention-with-learnable-routing-and-qat-2141-efa1e787
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning (2026)
- SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer (2026)
- PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers (2026)
- Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention (2026)
- Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs (2026)
- VMonarch: Efficient Video Diffusion Transformers with Structured Attention (2026)
- Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
