Papers
arxiv:2606.24595

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

Published on Jun 23
· Submitted by
Zhen Wang
on Jun 24
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Long-term memory in LLM agents should be evaluated as an auditable post-interaction artifact by reconstructing structured user state from the agent's memory, as demonstrated by MEMPROBE, a benchmark testing memory recovery against synthetic ground truth across 50 simulated users with 31 hidden dimensions each.

Long-term memory promises LLM agents that grow more capable across sessions, maintaining an accurate, evolving understanding of the user that interaction forms. In practice, however, this memory is evaluated mostly through downstream behavior, such as later answers, personalization quality, or task success, which tests that understanding only indirectly and leaves the memory artifact itself largely unaudited. We argue that long-term memory should instead be evaluated as an auditable post-interaction artifact: after ordinary assistance, what structured user state can be reconstructed from the memory the agent leaves behind? We instantiate this view in MEMPROBE, a benchmark in which a memory-equipped agent assists simulated users, each carrying a hidden, taxonomy-anchored user-state bank, across a trajectory of leak-controlled tasks, after which that bank is reconstructed from the agent's resulting memory under both full-store and top-k access. Built on synthetic ground truth for efficient, scalable measurement, MEMPROBE spans 50 simulated users with 31 hidden dimensions each (1,550 recovery targets) and tests 5 representative memory systems. Testing state-of-the-art memory agents, we find that successful assistance and recoverable memory behave as distinct capabilities. Task completion nearly saturates, even for a memoryless baseline, while category-balanced recovery stays moderate (about 0.6) and drops further under top-k retrieval. MEMPROBE is the first benchmark to study memory recovery directly, reconstructing the user state a system retains and scoring it against ground truth. We see recovery as a concrete objective for future memory agents to optimize, and MEMPROBE as a step toward an environment where agents are trained to remember their users, growing more faithful the longer they know them.

Community

MemProbe reframes long-term agent memory from “Did the assistant answer correctly?” to “What did it actually learn about the user?”, introducing a recovery-based benchmark for auditing whether agents build faithful, retrievable user models over time.

Its core technique is hidden user-state recovery: simulate users with concealed, taxonomy-grounded states, let evidence leak naturally through ordinary tasks, then reconstruct those states from the agent’s memory to audit what was written, what was retrievable, and what was lost.

By making memory observable, measurable, and debuggable, MemProbe points toward a practical and scalable environment for building agents that can support truly personalized assistance.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.24595 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.24595 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.24595 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.