Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction
Abstract
Direct corpus interaction enables more effective agentic search by allowing agents to query raw text directly, outperforming traditional retrieval methods in complex tasks.
Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.
Community
π₯ The best retriever for agentic search β¦ is no retriever. Introducing Direct Corpus Interaction (DCI).
π We replaced the entire agentic search pipeline β embedding model, vector index, top-k retrieval β with only grep and bash. π§
π‘The Magic:
The agent searches the raw corpus directly β grep, find, bash, shell pipelines β exactly like a coding agent navigating a codebase. No preprocess. No embedding model. No vector index. No offline indexing.
πThe Results:
DCI outperforms top baselines across 13 benchmarks, with average gains of:
π Agentic Search (BrowseComp-Plus): +11.0%
π§ Multi-hop QA: +30.7%
π IR Ranking: +21.5%
π‘ Insights:
Beyond accuracy, we conduct a series of controlled ablation studies to pinpoint the sources of DCIβs gains in Section 4. Specifically, we examine trajectory-level search patterns (RQ2), evidence utilization (RQ3), corpus scale (RQ4), context management (RQ5), and tool usage (RQ6).
Try it yourself!
π¨βπ» GitHub: https://github.com/DCI-Agent/DCI-Agent-Lite
π Demo: https://huggingface.co/spaces/DCI-Agent/demo
π Eval logs: https://huggingface.co/datasets/DCI-Agent/eval-logs
Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/beyond-semantic-similarity-rethinking-retrieval-for-agentic-search-via-direct-corpus-interaction-2755-8403a410
Covers the executive summary, detailed methodology, and practical applications.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search (2026)
- Coding Agents are Effective Long-Context Processors (2026)
- Reproducing Complex Set-Compositional Information Retrieval (2026)
- FitText: Evolving Agent Tool Ecologies via Memetic Retrieval (2026)
- TaSR-RAG: Taxonomy-guided Structured Reasoning for Retrieval-Augmented Generation (2026)
- CoSearch: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search (2026)
- Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper