Prediction Bottlenecks Don't Discover Causal Structure (But Here's What They Actually Do)
Abstract
A Mamba state-space model's claimed recovery of Granger-causal structure through a simple readout was tested across synthetic and real datasets with interventions, revealing the method-level claim does not hold when accounting for confounding factors and baseline approaches.
A Mamba state-space model trained only for next-step prediction appears to recover Granger-causal structure through a simple readout S = |W_{out} W_{in}|, with early experiments suggesting the phenomenon generalized across architectures and benefited from interventional data at p < 10^{-5}. We package the protocol used to test that claim -- standardized synthetic generators (VAR/Lorenz/CauseMe-style), three intervention semantics (do(X=c), soft-noise, random-forcing), edge-provenance cards on three real datasets, and size-matched control arms -- as a reusable falsification benchmark, and walk the claim through it in five stages. The method-level claim does not survive: (i) a plain linear bottleneck does as well or better; (ii) tuned Lasso beats the bottleneck on synthetic CauseMe-style benchmarks, and on Lorenz-96 (the only real benchmark with unambiguous ground truth) classical PCMCI and Granger lead a tight cluster in which the bottleneck trails; (iii) the headline intervention advantage is roughly 60% a sample-size confound, and the residual disappears under standard do(X=c) interventions, surviving only under a non-standard random-forcing scheme; (iv) even that residual reproduces, with a larger effect, in classical bivariate Granger -- the effect is method-agnostic. What survives is a narrow characterization result; the benchmark is the lasting artifact, and each stage above is one of its control arms.
Community
This paper falsifies the claim that next-step prediction bottlenecksโespecially Mamba/SSM weight projectionsโrecover causal structure, showing instead that their apparent gains are mostly low-rank regression, sample-size confounds, intervention-semantics artifacts, and target-corruption robustness, with the main durable contribution being a reusable falsification benchmark.
โก๏ธ ๐๐๐ฒ ๐๐ข๐ ๐ก๐ฅ๐ข๐ ๐ก๐ญ๐ฌ ๐จ๐ ๐ญ๐ก๐๐ข๐ซ ๐๐ซ๐๐๐ข๐๐ญ๐ข๐จ๐ง-๐๐ฌ-๐๐๐ฎ๐ฌ๐๐ฅ-๐๐ข๐ฌ๐๐จ๐ฏ๐๐ซ๐ฒ ๐ ๐๐ฅ๐ฌ๐ข๐๐ข๐๐๐ญ๐ข๐จ๐ง ๐ ๐ซ๐๐ฆ๐๐ฐ๐จ๐ซ๐ค:โจ
๐งช ๐น๐๐๐๐๐๐๐ ๐ญ๐๐๐-๐บ๐๐๐๐ ๐ญ๐๐๐๐๐๐๐๐๐๐๐๐ ๐ฉ๐๐๐๐๐๐๐๐: Introduces a control-heavy benchmark spanning VAR, Lorenz-96, CauseMe-style generators, real datasets with edge-provenance cards, matched-capacity architectures, size-matched observational controls, and multiple intervention semantics to stress-test claims that prediction models implicitly recover causal graphs.
๐งฉ ๐พ๐๐๐๐๐-๐ท๐๐๐๐๐๐๐๐๐ ๐ช๐๐๐๐๐๐๐๐ ๐ซ๐๐๐ ๐ต๐๐ ๐บ๐๐๐๐๐๐ ๐ช๐๐๐๐๐๐๐: Tests the extraction rule (S = |W_{out}W_{in}|) for bottleneck predictors and shows that linear bottlenecks match or beat Mamba SSMs, tuned Lasso dominates on synthetic graph recovery, and classical PCMCI/Granger-style methods outperform the bottleneck on clean Lorenz-96 ground truth.
๐ง ๐ฐ๐๐๐๐๐๐๐๐๐๐๐ ๐ฎ๐๐๐๐ ๐จ๐๐ ๐ช๐๐๐๐๐๐๐ ๐, ๐ต๐๐ ๐ช๐๐๐๐๐ ๐ฌ๐๐๐๐๐๐๐๐๐: Demonstrates that the reported interventional advantage mostly comes from extra sample size and a non-standard per-step random-forcing intervention; under proper (do(X_i=c)) interventions the effect nearly vanishes, while the residual appears even more strongly in classical bivariate Granger, indicating method-agnostic target-corruption robustness rather than learned causal discovery.
Get this paper in your agent:
hf papers read 2605.09169 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper