arxiv:2605.24117

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Published on May 22

· Submitted by

Donghao Zhou on May 26

The Ohio State University

Upvote

Authors:

Abstract

Current large language model agents struggle to form robust reusable skills from episodic experience, with raw trajectory reuse often outperforming distilled skills due to discarded contextual cues.

AI-generated summary

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.

View arXiv page View PDF Project page Add to collection

Community

donghao-zhou

Paper submitter 2 days ago

🔥 SkillEvolBench: https://skillevolbench.github.io/

xixy

2 days ago

Nice work.

avahal

2 days ago

the most interesting wrinkle in SkillEvolBench is that, across experiments, raw trajectory reuse tends to outperform distilled skills, even when the library is meant to capture reusable procedures. a hard question: are acquisition tasks evenly sampled across the six families, or does a heavy tail of scenarios inflate the raw-trace advantage and mislead the comparison? it'd be great to see a targeted ablation, like removing the verifier feedback or freezing the skill author mid-run, to isolate which component actually drives any transfer. btw, the arxivlens breakdown did a nice job unpacking the method details and helped me track how the host-side Skill Author interacts with the frozen library. i wonder if future work can push toward multilevel, composable skill hierarchies that survive uncertain deployment rather than patchy episode-level patches.

leiyingtie

about 11 hours ago

Thanks, this is a great question. A quick clarification: the acquisition tasks are not sampled from a heavy-tailed scenario pool. The benchmark uses a fixed stratified setup. Each of the six environments has five latent skill families, and each family contributes exactly three learning tasks: T1 canonical, T2 enriched, and T3 variant. After that, the library is frozen and we evaluate on T4 context-shift, T5 adversarial, and T6 composition tasks. So every family contributes the same amount of acquisition evidence.

Raw-RAG is constrained in the same spirit. It only retrieves learning trajectories from the same family. It does not pull in cross-family traces, and it does not use evaluation traces as memory. So an evaluation task can only see the learning trajectory chain from its own family, which is the same learning window used by the skill-based methods to build or revise their libraries.

The ablation suggestions are very helpful too. Removing verifier feedback, freezing the Skill Author after an early stage, or separating “trajectory evidence is available” from “the author is allowed to revise skills” would help isolate whether transfer comes mainly from raw episodic evidence, verifier-guided revision, or the Skill Author’s abstraction step. We’ll consider adding these breakdowns in future experiments.