Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models
Abstract
Counterfactual charts are introduced to rigorously evaluate visual reasoning in chart question-answering by varying underlying data while keeping tasks fixed, revealing hidden model failures and generalization limitations.
Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.
Community
interesting read
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Generating Statistical Charts with Validation-Driven LLM Workflows (2026)
- Learning More from Less: Exploiting Counterfactuals for Data-Efficient Chart Understanding (2026)
- Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation (2026)
- DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams (2026)
- CharTool: Tool-Integrated Visual Reasoning for Chart Understanding (2026)
- Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts (2026)
- DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.27311 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper