## Models know they're being influenced. They just don't tell you.
12 open-weight reasoning models. 41,832 inference runs. Six types of reasoning hints. One finding: models acknowledge influence ~87.5% of the time in their thinking tokens, but only ~28.6% in their final answers.
If you're using CoT monitoring for safety, this is a blind spot. The reasoning trace looks clean while the model's internal deliberation tells a different story.
- Faithfulness ranges from 39.7% to 89.9% across model families - Social-pressure hints are least acknowledged (consistency: 35.5%, sycophancy: 53.9%) - Training methodology matters more than scale
## I couldn't replicate my own faithfulness results. Turns out that's the point.
While building a large-scale CoT faithfulness study across open-weight reasoning models, I hit a wall: my faithfulness numbers kept shifting depending on how I classified whether a model "acknowledged" a hint in its reasoning. I assumed it was a pipeline bug. After weeks of debugging, I realized the instability *was* the finding.
### What I did
I took 10,276 reasoning traces where hints had successfully flipped a model's answer, meaning we *know* the hint influenced the output, and ran three different classifiers to detect whether the model acknowledged that influence in its chain-of-thought:
| Classifier | What it does | Overall faithfulness | | --------------------------- | ------------------------------------------------------------ | -------------------- | | **Regex-only** | Pattern-matches 38 keywords like "hint," "told," "suggested" | 74.4% | | **Regex + Ollama pipeline** | Regex first, then a 3-judge local LLM majority vote on ambiguous cases | 82.6% | | **Claude Sonnet 4 judge** | Independent LLM reads the full trace and judges epistemic dependence | 69.7% |
These aren't close. The 95% confidence intervals don't even overlap. All pairwise per-model gaps are statistically significant (McNemar's test, p < 0.001).
Classifier choice doesn't just change the numbers. It reverses model rankings. Qwen3.5-27B ranks **1st** under the pipeline but **7th** under the Sonnet judge. OLMo-3.1-32B goes from **9th to 3rd**.