Papers
arxiv:2601.07395

MCP-ITP: An Automated Framework for Implicit Tool Poisoning in MCP

Published on Jan 12
Authors:
,
,
,

Abstract

Implicit tool poisoning attacks in LLM agents manipulate behavior through malicious metadata without invoking poisoned tools, requiring automated frameworks for detection and mitigation.

AI-generated summary

To standardize interactions between LLM-based agents and their environments, the Model Context Protocol (MCP) was proposed and has since been widely adopted. However, integrating external tools expands the attack surface, exposing agents to tool poisoning attacks. In such attacks, malicious instructions embedded in tool metadata are injected into the agent context during MCP registration phase, thereby manipulating agent behavior. Prior work primarily focuses on explicit tool poisoning or relied on manually crafted poisoned tools. In contrast, we focus on a particularly stealthy variant: implicit tool poisoning, where the poisoned tool itself remains uninvoked. Instead, the instructions embedded in the tool metadata induce the agent to invoke a legitimate but high-privilege tool to perform malicious operations. We propose MCP-ITP, the first automated and adaptive framework for implicit tool poisoning within the MCP ecosystem. MCP-ITP formulates poisoned tool generation as a black-box optimization problem and employs an iterative optimization strategy that leverages feedback from both an evaluation LLM and a detection LLM to maximize Attack Success Rate (ASR) while evading current detection mechanisms. Experimental results on the MCPTox dataset across 12 LLM agents demonstrate that MCP-ITP consistently outperforms the manually crafted baseline, achieving up to 84.2% ASR while suppressing the Malicious Tool Detection Rate (MDR) to as low as 0.3%.

Community

The implicit-tool-poisoning setup is a great example of why agent security needs to look beyond the invoked tool. If malicious metadata can steer the agent into using a different high-privilege tool, then scanning only the final tool call misses the real attack path.

We are building Armorer Guard around that runtime boundary: local Rust scanner, structured JSON verdicts, Python support, and labels for prompt injection, sensitive-data request, exfiltration, safety bypass, destructive command, and system-prompt extraction. The goal is to score tool metadata, retrieved context, and proposed tool-call payloads before the agent acts.

Demo: https://huggingface.co/spaces/armorer-labs/armorer-guard-demo
Repo: https://github.com/ArmorerLabs/Armorer-Guard

A useful benchmark extension here would be measuring whether a guard catches the poisoned metadata before it becomes an apparently legitimate downstream tool invocation.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2601.07395
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.07395 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.07395 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.