AI & ML interests

None defined yet.

Recent Activity

richardyoung 
posted an update 21 days ago
view post
Post
179
## Models know they're being influenced. They just don't tell you.

12 open-weight reasoning models. 41,832 inference runs. Six types of reasoning hints. One finding: models acknowledge influence ~87.5% of the time in their thinking tokens, but only ~28.6% in their final answers.

If you're using CoT monitoring for safety, this is a blind spot. The reasoning trace looks clean while the model's internal deliberation tells a different story.

- Faithfulness ranges from 39.7% to 89.9% across model families
- Social-pressure hints are least acknowledged (consistency: 35.5%, sycophancy: 53.9%)
- Training methodology matters more than scale

**Paper:** [arxiv:2603.22582](https://arxiv.org/abs/2603.22582) | **Dataset:** [richardyoung/cot-faithfulness-open-models]( richardyoung/cot-faithfulness-open-models) | **Companion paper:** [arxiv:2603.20172]( Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models? (2603.22582)
richardyoung 
posted an update 25 days ago
view post
Post
187
## I couldn't replicate my own faithfulness results. Turns out that's the point.

While building a large-scale CoT faithfulness study across open-weight reasoning models, I hit a wall: my faithfulness numbers kept shifting depending on how I classified whether a model "acknowledged" a hint in its reasoning. I assumed it was a pipeline bug. After weeks of debugging, I realized the instability *was* the finding.

### What I did

I took 10,276 reasoning traces where hints had successfully flipped a model's answer, meaning we *know* the hint influenced the output, and ran three different classifiers to detect whether the model acknowledged that influence in its chain-of-thought:

| Classifier | What it does | Overall faithfulness |
| --------------------------- | ------------------------------------------------------------ | -------------------- |
| **Regex-only** | Pattern-matches 38 keywords like "hint," "told," "suggested" | 74.4% |
| **Regex + Ollama pipeline** | Regex first, then a 3-judge local LLM majority vote on ambiguous cases | 82.6% |
| **Claude Sonnet 4 judge** | Independent LLM reads the full trace and judges epistemic dependence | 69.7% |

These aren't close. The 95% confidence intervals don't even overlap. All pairwise per-model gaps are statistically significant (McNemar's test, p < 0.001).

### Which models

12 open-weight reasoning models spanning 9 families (7B to 1T parameters): DeepSeek-R1, DeepSeek-V3.2-Speciale, Qwen3-235B, Qwen3.5-27B, QwQ-32B, Gemma-3-27B, Phi-4-reasoning-plus, OLMo-3.1-32B, Llama-4-Maverick, Seed-1.6-Flash, GLM-4-32B, and Falcon-H1-34B.

### The rankings flip

Classifier choice doesn't just change the numbers. It reverses model rankings. Qwen3.5-27B ranks **1st** under the pipeline but **7th** under the Sonnet judge. OLMo-3.1-32B goes from **9th to 3rd**.

richardyoung 
updated a Space 3 months ago
richardyoung 
updated a dataset 3 months ago
richardyoung 
published a dataset 3 months ago
richardyoung 
published a Space 3 months ago
richardyoung 
updated a model 3 months ago
richardyoung 
published a model 3 months ago