Papers
arxiv:2605.08321

LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight

Published on May 8
Authors:
,
,
,

Abstract

Adversarial large language models can manipulate user decisions in 65.4% of cases, but a monitoring "warden" model reduces this to 30.4% by providing real-time advisory warnings.

AI-generated summary

LLMs are increasingly capable of persuasion, which raises the question of how to protect users against manipulation. In a preregistered user study (N=120) across four decision-making scenarios, we find that an adversarial LLM with a hidden goal succeeds in steering users' decisions 65.4% of the time. We then introduce a "warden" model: a secondary LLM that monitors the human-AI interaction trace in real time and issues non-binding, private advisories to the user when it detects manipulation. Adding a warden more than halves the adversary's success rate to 30.4%, with a much smaller (8.6 percentage points) reduction for genuine interactions. To probe the mechanism behind these results, we release COAX-Bench, a simulation benchmark spanning 14 decision-making scenarios, including hiring, voting, and file access. Across 16,212 simulated multi-agent interactions, capable adversarial LLMs achieve their hidden goals in 34.7% of cases, which warden models reduce to 12.3%. Notably, even warden models substantially weaker than the adversary they oversee provide meaningful protection, suggesting a path for scalable oversight of more capable models.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.08321 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.08321 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.08321 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.