Post
75
## I couldn't replicate my own faithfulness results. Turns out that's the point.
While building a large-scale CoT faithfulness study across open-weight reasoning models, I hit a wall: my faithfulness numbers kept shifting depending on how I classified whether a model "acknowledged" a hint in its reasoning. I assumed it was a pipeline bug. After weeks of debugging, I realized the instability *was* the finding.
### What I did
I took 10,276 reasoning traces where hints had successfully flipped a model's answer, meaning we *know* the hint influenced the output, and ran three different classifiers to detect whether the model acknowledged that influence in its chain-of-thought:
| Classifier | What it does | Overall faithfulness |
| --------------------------- | ------------------------------------------------------------ | -------------------- |
| **Regex-only** | Pattern-matches 38 keywords like "hint," "told," "suggested" | 74.4% |
| **Regex + Ollama pipeline** | Regex first, then a 3-judge local LLM majority vote on ambiguous cases | 82.6% |
| **Claude Sonnet 4 judge** | Independent LLM reads the full trace and judges epistemic dependence | 69.7% |
These aren't close. The 95% confidence intervals don't even overlap. All pairwise per-model gaps are statistically significant (McNemar's test, p < 0.001).
### Which models
12 open-weight reasoning models spanning 9 families (7B to 1T parameters): DeepSeek-R1, DeepSeek-V3.2-Speciale, Qwen3-235B, Qwen3.5-27B, QwQ-32B, Gemma-3-27B, Phi-4-reasoning-plus, OLMo-3.1-32B, Llama-4-Maverick, Seed-1.6-Flash, GLM-4-32B, and Falcon-H1-34B.
### The rankings flip
Classifier choice doesn't just change the numbers. It reverses model rankings. Qwen3.5-27B ranks **1st** under the pipeline but **7th** under the Sonnet judge. OLMo-3.1-32B goes from **9th to 3rd**.
While building a large-scale CoT faithfulness study across open-weight reasoning models, I hit a wall: my faithfulness numbers kept shifting depending on how I classified whether a model "acknowledged" a hint in its reasoning. I assumed it was a pipeline bug. After weeks of debugging, I realized the instability *was* the finding.
### What I did
I took 10,276 reasoning traces where hints had successfully flipped a model's answer, meaning we *know* the hint influenced the output, and ran three different classifiers to detect whether the model acknowledged that influence in its chain-of-thought:
| Classifier | What it does | Overall faithfulness |
| --------------------------- | ------------------------------------------------------------ | -------------------- |
| **Regex-only** | Pattern-matches 38 keywords like "hint," "told," "suggested" | 74.4% |
| **Regex + Ollama pipeline** | Regex first, then a 3-judge local LLM majority vote on ambiguous cases | 82.6% |
| **Claude Sonnet 4 judge** | Independent LLM reads the full trace and judges epistemic dependence | 69.7% |
These aren't close. The 95% confidence intervals don't even overlap. All pairwise per-model gaps are statistically significant (McNemar's test, p < 0.001).
### Which models
12 open-weight reasoning models spanning 9 families (7B to 1T parameters): DeepSeek-R1, DeepSeek-V3.2-Speciale, Qwen3-235B, Qwen3.5-27B, QwQ-32B, Gemma-3-27B, Phi-4-reasoning-plus, OLMo-3.1-32B, Llama-4-Maverick, Seed-1.6-Flash, GLM-4-32B, and Falcon-H1-34B.
### The rankings flip
Classifier choice doesn't just change the numbers. It reverses model rankings. Qwen3.5-27B ranks **1st** under the pipeline but **7th** under the Sonnet judge. OLMo-3.1-32B goes from **9th to 3rd**.