Instructions to use ai4data/datause-extraction-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER2
How to use ai4data/datause-extraction-v1 with GLiNER2:
from gliner2 import GLiNER2 model = GLiNER2.from_pretrained("ai4data/datause-extraction-v1") # Extract entities text = "Apple CEO Tim Cook announced iPhone 15 in Cupertino yesterday." result = extractor.extract_entities(text, ["company", "person", "product", "location"]) print(result) - Notebooks
- Google Colab
- Kaggle
datause-extraction-v1
This is the official fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions, their attributes, geographic coverage, and usage roles from economics and development research documents.
The model leverages a joint entity + relation extraction schema to detect mentions and link them to metadata (producers, acronyms, timeframes, and countries) without suffering from choices-based prefix collision.
Rationale and Context: Forced Displacement, Refugees, and FCV
Why This Model Was Created
Tracking and monitoring the use of datasets in Fragility, Conflict, and Violence (FCV) settings and forced displacement contexts is critical. Research on refugees, internally displaced persons (IDPs), and host communities is highly dependent on diverse data sources—ranging from large-scale household surveys to localized administrative registration systems.
Understanding which datasets are being utilized, who is producing them, and how they are integrated into policy analysis helps international organizations (such as the World Bank and UNHCR), researchers, and funding bodies:
- Monitor Data Investments: Quantify the impact and academic/policy reach of dedicated data initiatives (e.g., those funded by the World Bank-UNHCR Joint Data Center on Forced Displacement).
- Identify Data Gaps: Discover regions or populations where FCV analyses lack primary microdata and are forced to rely solely on background or secondary estimates.
- Avoid Duplication: Map existing research projects to avoid redundant data collection efforts in challenging, insecure environments.
Due to the unstructured nature of academic literature and policy briefs, this has historically required labor-intensive manual reviews. This model automates this pipeline by identifying verbatim data mentions, their creators, and their exact analytical roles (primary data source, validation support, or passing background citation).
Data Sources & Domain Coverage
The training data for this model was curated using actual research documents, reports, and working papers from major development institutions operating in FCV regions. Key data sources referenced in the corpus include:
- Humanitarian Registries: UNHCR's proGRES database, registration rolls from national border/refugee agencies, and program databases.
- Displacement Tracking Systems: IOM's Displacement Tracking Matrix (DTM) reports and the Internal Displacement Monitoring Centre (IDMC) registries.
- Household Surveys in FCV Contexts: Living Standards Measurement Study (LSMS) surveys, Demographic and Health Surveys (DHS), Multiple Indicator Cluster Surveys (MICS), and specialized welfare monitoring surveys (e.g., SHINE, SESRE).
- Geospatial & Spatial Databases: Climate/weather indicators, conflict event databases (e.g., ACLED), and satellite camp imagery.
Usage Option 1: Using the ai4data Library Wrapper (Recommended)
It is highly recommended to interact with this model using the official ai4data Python wrapper. The library handles markdown-aware document parsing, sliding context windows, overlap resolution, and entity-relation alignment automatically.
Install the library directly from GitHub:
pip install git+https://github.com/worldbank/ai4data.git
1. Extract from Text
from ai4data.data_use import extract_from_text
text = """
We use the 2022 Demographic and Health Survey (DHS) conducted by the National Statistics Office in Uganda
to analyze child health outcomes. We complement this with population records from the Ministry of Health.
"""
results = extract_from_text(text, include_confidence=True)
for dataset in results["datasets"]:
name = dataset["mention_name"]["text"]
acronym = dataset["acronym"]["text"] or "N/A"
producer = dataset["producer"]["text"] or "N/A"
geography = dataset["geography"]["text"] or "N/A"
usage = dataset["usage_context"]["text"]
print(f"Dataset: {name} ({acronym})")
print(f" Producer: {producer} | Geography: {geography} | Role: {usage}\n")
2. Extract from PDF
from ai4data.data_use import extract_from_document
# Extracts from a local path or a PDF URL
pdf_url = "https://pdf.usaid.gov/pdf_docs/PA00TB5D.pdf"
results = extract_from_document(pdf_url, pages=[0, 1, 2])
for page_data in results:
print(f"--- Page {page_data['page']} ---")
for dataset in page_data["datasets"]:
print(f"Found mention: {dataset['mention_name']['text']}")
Usage Option 2: Using raw gliner2 Library (Without the wrapper)
If you prefer to integrate the model directly without using the wrapper library, you can use the raw gliner2 package.
1. Installation
pip install gliner2 huggingface_hub
2. Code Example
from gliner2 import GLiNER2
from huggingface_hub import snapshot_download
BASE_MODEL = "fastino/gliner2-large-v1"
ADAPTER_ID = "ai4data/datause-extraction-v1"
# 1. Load model and adapter
model = GLiNER2.from_pretrained(BASE_MODEL)
model.load_adapter(snapshot_download(ADAPTER_ID))
model.eval()
# 2. Define schema
ENTITY_DEFS = {
"name": "The exact full name of the data source or dataset",
"acronym": "The acronym or abbreviation if any",
"producer": "The organization or entity that produced or published the data",
"timeframe": "The year or time period of the data such as 2019 or 2019 to 2020",
"datatype": "The type of data verbatim from text such as survey, report, census, program, system, or assessment",
"geography": "The country, region, or geographic area the data covers",
"specificity": "Whether this mention is named, descriptive, or vague",
"usage": "Whether this is primary, supporting, or background data",
}
RELATION_DEFS = {
"has_acronym": "The acronym of the dataset",
"has_producer": "The producer of the dataset",
"has_timeframe": "The timeframe of the dataset",
"has_datatype": "The data type of the dataset",
"has_geography": "The country or geographic coverage area of the dataset",
"has_specificity": "Whether this dataset is named, descriptive, or vague",
"has_usage": "Whether this dataset is primary, supporting, or background",
}
schema = model.create_schema()
schema.entities(ENTITY_DEFS)
schema.relations(RELATION_DEFS)
# 3. Add prompt prefix
LABEL_PREFIX = "specificity: named | descriptive | vague usage: primary | supporting | background |"
text = "We use the 2022 Demographic and Health Survey (DHS) conducted by the National Statistics Office in Uganda."
prefixed_text = f"{LABEL_PREFIX} {text}"
# 4. Extract
outputs = model.extract(prefixed_text, schema, threshold=0.3)
print(outputs)
Response Structure (Wrapper Output)
Each item in the returned "datasets" list from the ai4data library is structured as follows:
{
"mention_name": {
"text": "Demographic and Health Survey",
"confidence": 0.9998,
"start": 12,
"end": 41
},
"specificity_tag": {
"text": "named",
"confidence": 0.9998,
"start": 12,
"end": 41
},
"usage_context": {
"text": "primary",
"confidence": 0.9998,
"start": 12,
"end": 41
},
"typology_tag": {
"text": "survey",
"confidence": 0.9998,
"start": 12,
"end": 41
},
"acronym": {
"text": "DHS",
"confidence": 0.9996,
"start": 43,
"end": 46
},
"producer": {
"text": "National Statistics Office",
"confidence": 0.9992,
"start": 60,
"end": 86
},
"reference_year": {
"text": "2022",
"confidence": 0.9998,
"start": 7,
"end": 11
},
"geography": {
"text": "Uganda",
"confidence": 0.9997,
"start": 90,
"end": 96
}
}
Attribute Fields
| Field | Type | Description |
|---|---|---|
mention_name |
String / Span | Verbatim name of the dataset mentioned in the text |
specificity_tag |
Choice / Span | Precision classification: named / descriptive / vague |
usage_context |
Choice / Span | Analytical role: primary (core dataset) / supporting (context/validation) / background (passing reference) |
is_used |
Boolean / Span | Derived field: True if usage_context is primary/supporting, False if background |
typology_tag |
Choice / Span | Derived/mapped data type: survey / census / administrative / database / indicator / geospatial / microdata / report / other |
acronym |
String / Span | Abbreviation or acronym linked to the dataset |
producer |
String / Span | Organizing body or agency that published/collected the data |
reference_year |
String / Span | Year or timeframe the data represents |
geography |
String / Span | Geographic coverage or region the data represents |
Model tree for ai4data/datause-extraction-v1
Base model
fastino/gliner2-large-v1