arxiv:2606.14383

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

Published on Jun 12

· Submitted by

Liang Ding (Hiring https://liamding.cc/hiring.htm) on Jun 18

1688 multimodal & industrial AI

Upvote

Authors:

Abstract

IndustryBench-MIPU is introduced as the first large-scale benchmark for multi-image industrial product understanding, focusing on structured attribute extraction from heterogeneous product images to evaluate multimodal models' ability to recover dense technical specifications.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction -- recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86--94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15--34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.

View arXiv page View PDF Project page GitHub 3 Add to collection

Community

alphadl

Paper submitter about 7 hours ago

Hi everyone! We are excited to share our team's latest work from Alibaba: IndustryBench-MIPU.

While MLLMs are increasingly deployed for general visual tasks, understanding complex industrial products requires assembling dense technical specifications scattered across multiple heterogeneous images, including specification tables, nameplates, and technical drawings. To bridge this gap, we built the first large-scale benchmark for multi-image industrial product understanding.

Key Highlights of our Benchmark:

Massive Scale: 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories.

Complex Challenges: The task requires models to jointly perform text recognition, visual reasoning over technical drawings, domain knowledge interpretation, and cross-image evidence integration.

Core Findings: We evaluated 9 MLLMs and uncovered a stark completeness gap. While current models achieve high precision (86-94%), the absolute best model recovers only 49.9% of product-level attributes.

Our evaluation proves that multi-image completeness, rather than single-image accuracy, is the true bottleneck for real-world industrial AI. As we continue to push the boundaries of multimodal and industrial intelligence, we hope this dataset and benchmark serve as a valuable testbed for the community.

We would love to hear your thoughts, feedback, and see how your models perform!

Paper: arxiv.org/abs/2606.14383

Dataset: huggingface.co/datasets/alibaba-multimodal-industrial-ai/IndustryBench-MIPU

Code: github.com/alibaba-multimodal-industrial-ai/IndustryBench-MIPU

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.14383 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.14383 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.