PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval
Abstract
PhotoBench presents the first authentic personal photo retrieval benchmark that shifts focus from visual matching to personalized multi-source intent-driven reasoning, revealing limitations in current unified embedding models and highlighting the need for advanced agentic reasoning systems.
Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated web snapshots, failing to capture the multi-source reasoning required to resolve authentic, intent-driven user queries. To bridge this gap, we introduce PhotoBench, the first benchmark constructed from authentic, personal albums. It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning. Based on a rigorous multi-source profiling framework, which integrates visual semantics, spatial-temporal metadata, social identity, and temporal events for each image, we synthesize complex intent-driven queries rooted in users' life trajectories. Extensive evaluation on PhotoBench exposes two critical limitations: the modality gap, where unified embedding models collapse on non-visual constraints, and the source fusion paradox, where agentic systems perform poor tool orchestration. These findings indicate that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion. Our PhotoBench is available.
Community
Paper link: https://arxiv.org/abs/2603.01493
Github Repo: https://github.com/LaVieEnRose365/PhotoBench
Leaderboard: https://www.sorrowcloud.tech/leaderboard
We will continue updating the benchmark and leaderboard.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories (2026)
- Reasoning-Augmented Representations for Multimodal Retrieval (2026)
- Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation (2026)
- KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering (2026)
- XR: Cross-Modal Agents for Composed Image Retrieval (2026)
- Fix Before Search: Benchmarking Agentic Query Visual Pre-processing in Multimodal Retrieval-augmented Generation (2026)
- MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper