arxiv:2605.06832

IntentGrasp: A Comprehensive Benchmark for Intent Understanding

Published on May 7

· Submitted by

Yuwei Yin on May 11

University of British Columbia

Upvote

Authors:

Yuwei Yin ,

Abstract

IntentGrasp is a benchmark for evaluating large language models' intent understanding capability, demonstrating poor performance across 20 models and showing significant improvements with intentional fine-tuning.

AI-generated summary

Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.

View arXiv page View PDF Project page GitHub 3 Add to collection

Community

yuweiyin

Paper author Paper submitter 1 day ago

Paper: https://arxiv.org/abs/2605.06832
- Authors: Yuwei Yin, Chuyuan Li, Giuseppe Carenini
- Institute: UBC NLP Group, Department of Computer Science, University of British Columbia
- Keywords: Intent Understanding, Dataset, Benchmark, LLM, Evaluation, Intentional Fine-Tuning
- Abstract:
  - Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants.
  - We introduce IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification.
  - IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases.
GitHub: https://github.com/YuweiYin/IntentGrasp
Dataset: https://huggingface.co/datasets/yuweiyin/IntentGrasp