Papers
arxiv:2604.00886

PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding

Published on Apr 1
· Submitted by
NAN
on Apr 2
Authors:
,
,
,

Abstract

PixelPrune reduces computational costs in Vision-Language Models by eliminating redundant image patches before Vision Transformer encoding through predictive-coding-based compression.

AI-generated summary

Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful -- across document and GUI benchmarks, only 22--71\% of image patches are pixel-unique, the rest being exact duplicates of another patch in the same image. We propose PixelPrune, which exploits this pixel-level redundancy through predictive-coding-based compression, pruning redundant patches before the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training-free, requires no learnable parameters, and supports pixel-lossless compression (τ{=}0) as well as controlled lossy compression (τ{>}0). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2times inference speedup and 1.9times training acceleration. Code is available at https://github.com/OPPO-Mente-Lab/PixelPrune.

Community

Paper submitter

PixelPrune compresses visual tokens before the ViT encoder via 2D predictive coding — no learnable parameters, training-free out-of-the-box, and compatible with FlashAttention. Fine-tuning is also supported for further gains. Code is available at https://github.com/OPPO-Mente-Lab/PixelPrune.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.00886
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.00886 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.00886 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.00886 in a Space README.md to link it from this page.

Collections including this paper 1