ChronoQA
ChronoQA is a passage-grounded benchmark that tests whether retrieval-augmented generation (RAG) systems can keep temporal and causal facts straight when reading long-form narratives (novels, scripts, etc.).
Instead of giving the entire book to the model, ChronoQA forces a RAG pipeline to retrieve the right snippets and reason about evolving characters and event sequences.
| Instances | 1,028 question–answer pairs |
| Narratives | 18 public-domain stories |
| Reasoning facets | 8 (causal, character, setting, …) |
| Evidence | Exact byte-offsets for each answer |
| Language | English |
| Intended use | Evaluate/train RAG systems that need chronology & causality |
| License (annotations) | CC-BY-NC-SA-4.0 |
Dataset Description
Motivation
Standard RAG pipelines often lose chronological order and collapse every mention of an entity into a single node. ChronoQA highlights the failures that follow. Example:
"Who was jinxing Harry's broom during his first Quidditch match?" – a system that only retrieves early chapters may wrongly answer Snape instead of Quirrell.
Source Stories
All texts come from Project Gutenberg (public domain in the US).
| ID | Title | # Q |
|---|---|---|
| 1 | A Study in Scarlet | 67 |
| 2 | The Hound of the Baskervilles | 55 |
| 3 | Harry Potter and the Chamber of Secrets | 30 |
| 4 | Harry Potter and the Sorcerer's Stone | 25 |
| 5 | Les Misérables | 72 |
| 6 | The Phantom of the Opera | 70 |
| 7 | The Sign of the Four | 62 |
| 8 | The Wonderful Wizard of Oz | 82 |
| 9 | The Adventures of Sherlock Holmes | 34 |
| 10 | Lady Susan | 88 |
| 11 | Dangerous Connections | 111 |
| 12 | The Picture of Dorian Gray | 27 |
| 13 | The Diary of a Nobody | 39 |
| 14 | The Sorrows of Young Werther | 58 |
| 15 | The Mysterious Affair at Styles | 69 |
| 16 | Pride and Prejudice | 54 |
| 17 | The Secret Garden | 61 |
| 18 | Anne of Green Gables | 24 |
Reasoning Facets
- Causal Consistency
- Character & Behavioural Consistency
- Setting, Environment & Atmosphere
- Symbolism, Imagery & Motifs
- Thematic, Philosophical & Moral
- Narrative & Plot Structure
- Social, Cultural & Political
- Emotional & Psychological
Dataset Structure
| Field | Type | Description |
|---|---|---|
story_id |
string |
ID of the narrative |
question_id |
int32 |
QA index within that story |
category |
string |
One of the 8 reasoning facets |
query |
string |
Natural-language question |
ground_truth |
string |
Gold answer |
passages |
sequence of objects |
Each object contains: • start_sentence string • end_sentence string • start_byte int32 • end_byte int32 • excerpt string |
story_title* |
string |
Human-readable title (optional, present in processed splits) |
*The raw JSONL released with the paper does not include story_title; it is added automatically in the hosted HF dataset for convenience.
There is a single all split (1,028 rows). Create your own train/validation/test splits if needed (e.g. by story or by reasoning facet).
Usage Example
from datasets import load_dataset
ds = load_dataset("your-org/chronoqa", split="all")
example = ds[0]
print("Question:", example["query"])
print("Answer :", example["ground_truth"])
print("Evidence:", example["passages"][0]["excerpt"][:300], "…")
Citation Information
@article{zhang2025respecting,
title={Respecting Temporal-Causal Consistency: Entity-Event Knowledge Graphs for Retrieval-Augmented Generation},
author={Zhang, Ze Yu and Li, Zitao and Li, Yaliang and Ding, Bolin and Low, Bryan Kian Hsiang},
journal={arXiv preprint arXiv:2506.05939},
year={2025}
}