DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis
Abstract
Exploratory financial data analysis benchmark challenges LLM agents' reliability in real-world scenarios with limited prior guidance and native data noise.
Autonomous data analysis agents are increasingly expected to conduct exploratory analysis over underexplored data environments. This burden is especially salient in complex financial analytics, where relevant evidence is rarely pre-specified. However, existing benchmarks typically evaluate such agents in prior-guided settings, providing selected data sources, explicit data schemas, or cleaned data, thereby understating the exploratory burden. We introduce DataClawBench, a benchmark for exploratory real-world financial data analysis under limited prior guidance. DataClawBench contains approximately 2.06 million real-world records across enterprise, industry, and policy domains, with native data noise preserved. It further includes 492 cross-domain tasks derived from think-tank consulting scenarios, each annotated with intermediate milestones that diagnose exploration and reasoning failures beyond outcome accuracy. A systematic evaluation of eight advanced LLMs under the OpenClaw agent reveals that exploratory data analysis breaks agent reliability: more exploration does not reliably translate into task-relevant progress or correct final answers.
Get this paper in your agent:
hf papers read 2605.02503 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper