arxiv:2605.27311

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

Published on May 26

· Submitted by

GUIJIN SON on May 28

Upvote

Authors:

Jesse C. Cresswell ,

Abstract

Counterfactual charts are introduced to rigorously evaluate visual reasoning in chart question-answering by varying underlying data while keeping tasks fixed, revealing hidden model failures and generalization limitations.

AI-generated summary

Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.