datause-extraction-v1

This is the official fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions, their attributes, geographic coverage, and usage roles from economics and development research documents.

The model leverages a joint entity + relation extraction schema to detect mentions and link them to metadata (producers, acronyms, timeframes, and countries) without suffering from choices-based prefix collision.


Rationale and Context: Forced Displacement, Refugees, and FCV

Why This Model Was Created

Tracking and monitoring the use of datasets in Fragility, Conflict, and Violence (FCV) settings and forced displacement contexts is critical. Research on refugees, internally displaced persons (IDPs), and host communities is highly dependent on diverse data sources—ranging from large-scale household surveys to localized administrative registration systems.

Understanding which datasets are being utilized, who is producing them, and how they are integrated into policy analysis helps international organizations (such as the World Bank and UNHCR), researchers, and funding bodies:

  1. Monitor Data Investments: Quantify the impact and academic/policy reach of dedicated data initiatives (e.g., those funded by the World Bank-UNHCR Joint Data Center on Forced Displacement).
  2. Identify Data Gaps: Discover regions or populations where FCV analyses lack primary microdata and are forced to rely solely on background or secondary estimates.
  3. Avoid Duplication: Map existing research projects to avoid redundant data collection efforts in challenging, insecure environments.

Due to the unstructured nature of academic literature and policy briefs, this has historically required labor-intensive manual reviews. This model automates this pipeline by identifying verbatim data mentions, their creators, and their exact analytical roles (primary data source, validation support, or passing background citation).

Data Sources & Domain Coverage

The training data for this model was curated using actual research documents, reports, and working papers from major development institutions operating in FCV regions. Key data sources referenced in the corpus include:

  • Humanitarian Registries: UNHCR's proGRES database, registration rolls from national border/refugee agencies, and program databases.
  • Displacement Tracking Systems: IOM's Displacement Tracking Matrix (DTM) reports and the Internal Displacement Monitoring Centre (IDMC) registries.
  • Household Surveys in FCV Contexts: Living Standards Measurement Study (LSMS) surveys, Demographic and Health Surveys (DHS), Multiple Indicator Cluster Surveys (MICS), and specialized welfare monitoring surveys (e.g., SHINE, SESRE).
  • Geospatial & Spatial Databases: Climate/weather indicators, conflict event databases (e.g., ACLED), and satellite camp imagery.

Usage Option 1: Using the ai4data Library Wrapper (Recommended)

It is highly recommended to interact with this model using the official ai4data Python wrapper. The library handles markdown-aware document parsing, sliding context windows, overlap resolution, and entity-relation alignment automatically.

Install the library directly from GitHub:

pip install git+https://github.com/worldbank/ai4data.git

1. Extract from Text

from ai4data.data_use import extract_from_text

text = """
We use the 2022 Demographic and Health Survey (DHS) conducted by the National Statistics Office in Uganda 
to analyze child health outcomes. We complement this with population records from the Ministry of Health.
"""

results = extract_from_text(text, include_confidence=True)

for dataset in results["datasets"]:
    name = dataset["mention_name"]["text"]
    acronym = dataset["acronym"]["text"] or "N/A"
    producer = dataset["producer"]["text"] or "N/A"
    geography = dataset["geography"]["text"] or "N/A"
    usage = dataset["usage_context"]["text"]
    
    print(f"Dataset: {name} ({acronym})")
    print(f"  Producer: {producer} | Geography: {geography} | Role: {usage}\n")

2. Extract from PDF

from ai4data.data_use import extract_from_document

# Extracts from a local path or a PDF URL
pdf_url = "https://pdf.usaid.gov/pdf_docs/PA00TB5D.pdf"
results = extract_from_document(pdf_url, pages=[0, 1, 2])

for page_data in results:
    print(f"--- Page {page_data['page']} ---")
    for dataset in page_data["datasets"]:
        print(f"Found mention: {dataset['mention_name']['text']}")

Usage Option 2: Using raw gliner2 Library (Without the wrapper)

If you prefer to integrate the model directly without using the wrapper library, you can use the raw gliner2 package.

1. Installation

pip install gliner2 huggingface_hub

2. Code Example

from gliner2 import GLiNER2
from huggingface_hub import snapshot_download

BASE_MODEL = "fastino/gliner2-large-v1"
ADAPTER_ID = "ai4data/datause-extraction-v1"

# 1. Load model and adapter
model = GLiNER2.from_pretrained(BASE_MODEL)
model.load_adapter(snapshot_download(ADAPTER_ID))
model.eval()

# 2. Define schema
ENTITY_DEFS = {
    "name": "The exact full name of the data source or dataset",
    "acronym": "The acronym or abbreviation if any",
    "producer": "The organization or entity that produced or published the data",
    "timeframe": "The year or time period of the data such as 2019 or 2019 to 2020",
    "datatype": "The type of data verbatim from text such as survey, report, census, program, system, or assessment",
    "geography": "The country, region, or geographic area the data covers",
    "specificity": "Whether this mention is named, descriptive, or vague",
    "usage": "Whether this is primary, supporting, or background data",
}

RELATION_DEFS = {
    "has_acronym": "The acronym of the dataset",
    "has_producer": "The producer of the dataset",
    "has_timeframe": "The timeframe of the dataset",
    "has_datatype": "The data type of the dataset",
    "has_geography": "The country or geographic coverage area of the dataset",
    "has_specificity": "Whether this dataset is named, descriptive, or vague",
    "has_usage": "Whether this dataset is primary, supporting, or background",
}

schema = model.create_schema()
schema.entities(ENTITY_DEFS)
schema.relations(RELATION_DEFS)

# 3. Add prompt prefix
LABEL_PREFIX = "specificity: named | descriptive | vague usage: primary | supporting | background |"
text = "We use the 2022 Demographic and Health Survey (DHS) conducted by the National Statistics Office in Uganda."
prefixed_text = f"{LABEL_PREFIX} {text}"

# 4. Extract
outputs = model.extract(prefixed_text, schema, threshold=0.3)
print(outputs)

Response Structure (Wrapper Output)

Each item in the returned "datasets" list from the ai4data library is structured as follows:

{
  "mention_name": {
    "text": "Demographic and Health Survey",
    "confidence": 0.9998,
    "start": 12,
    "end": 41
  },
  "specificity_tag": {
    "text": "named",
    "confidence": 0.9998,
    "start": 12,
    "end": 41
  },
  "usage_context": {
    "text": "primary",
    "confidence": 0.9998,
    "start": 12,
    "end": 41
  },
  "typology_tag": {
    "text": "survey",
    "confidence": 0.9998,
    "start": 12,
    "end": 41
  },
  "acronym": {
    "text": "DHS",
    "confidence": 0.9996,
    "start": 43,
    "end": 46
  },
  "producer": {
    "text": "National Statistics Office",
    "confidence": 0.9992,
    "start": 60,
    "end": 86
  },
  "reference_year": {
    "text": "2022",
    "confidence": 0.9998,
    "start": 7,
    "end": 11
  },
  "geography": {
    "text": "Uganda",
    "confidence": 0.9997,
    "start": 90,
    "end": 96
  }
}

Attribute Fields

Field Type Description
mention_name String / Span Verbatim name of the dataset mentioned in the text
specificity_tag Choice / Span Precision classification: named / descriptive / vague
usage_context Choice / Span Analytical role: primary (core dataset) / supporting (context/validation) / background (passing reference)
is_used Boolean / Span Derived field: True if usage_context is primary/supporting, False if background
typology_tag Choice / Span Derived/mapped data type: survey / census / administrative / database / indicator / geospatial / microdata / report / other
acronym String / Span Abbreviation or acronym linked to the dataset
producer String / Span Organizing body or agency that published/collected the data
reference_year String / Span Year or timeframe the data represents
geography String / Span Geographic coverage or region the data represents
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ai4data/datause-extraction-v1

Adapter
(8)
this model

Space using ai4data/datause-extraction-v1 1

Collection including ai4data/datause-extraction-v1