Gulf-S2S-lemmatizer

Model Description

The model is developed for Arabic dialect lemmatization, focusing on Gulf (GLF) Arabic. It follows a sequence-to-sequence formulation of lemmatization, where the model generates the lemma of a given word knowing 2 words before and 2 words after the current word rather than treating lemmas as fixed classification labels.

The model is evaluated using lemma accuracy as the main metric, with an additional normalized lemma accuracy metric that accounts for orthographic and diacritic variation. The full methodology, training setup, hyperparameters, and evaluation results are described in our paper “Lemmatizing Dialectal Arabic with Sequence-to-Sequence Models”

Standalone Usage

The model can also be used independently without the full lemmatization workflow on the GitHub repository (https://github.com/CAMeL-Lab/seq2seq-arabic-dialect-lemmatization). In this case, the input should contain the target word surrounded by the special token <target>, with up to two words before and two words after the target word.

import re
import math
import pandas as pd
import torch
from tqdm import tqdm
from tqdm.auto import tqdm
tqdm.pandas()

from transformers import T5Tokenizer, T5ForConditionalGeneration

DIALECT_MODELS = {
    "glf": "CAMeL-Lab/GLF-S2S-lemmatizer",
}

def load_model(s2s_dialect: str):
    model_name = DIALECT_MODELS[s2s_dialect]
    tokenizer = T5Tokenizer.from_pretrained(model_name, use_fast=True, legacy=False)
    model = T5ForConditionalGeneration.from_pretrained(model_name)
    tokenizer.add_special_tokens({"additional_special_tokens": ["<target>"]})
    model.resize_token_embeddings(len(tokenizer))
    return tokenizer, model

def predict(tokenizer, model, texts: list[str], device=None, batch_size: int = 16) -> list[str]:
    if device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device).eval()
    all_preds = []
    total_batches = math.ceil(len(texts) / batch_size)
    for i in tqdm(range(0, len(texts), batch_size), total=total_batches, desc="Predicting"):
        batch = texts[i:i + batch_size]
        enc = tokenizer(
            batch,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=64
        )
        enc = {k: v.to(device) for k, v in enc.items()}
        with torch.no_grad():
            out = model.generate(
                **enc,
                max_length=50,
                num_beams=1,
                do_sample=False
            )
        all_preds.extend(tokenizer.batch_decode(out, skip_special_tokens=True))
    return all_preds

def get_context_window_fast(sentence_index, word_index, window_size=2):
    words, indices = sentence_lookup[sentence_index]
    target_pos = indices.index(word_index)

    start_idx = max(0, target_pos - window_size)
    end_idx = min(len(words), target_pos + window_size + 1)
    context_words = words[start_idx:end_idx][:]
    target_word_idx = target_pos - start_idx
    context_words[target_word_idx] = f"<target>{context_words[target_word_idx]}<target>"

    return f"lemmatize: {' '.join(context_words)}"


# df should contain an input_text column with the target word marked using <target>
# Example input: "أنا أبي <target>أروح<target> البيت الحين"

# Sort df by sentence_index and word_index
df = df.sort_values(by=["sentence_index", "word_index"])

# Build a lookup dict: {sentence_index: (words_list, indices_list)}
sentence_lookup = {
    sid: (group['word'].astype(str).tolist(), group['word_index'].tolist())
    for sid, group in df.sort_values('word_index').groupby('sentence_index')
}

df['input_text'] = df.progress_apply(
    lambda row: get_context_window_fast(row['sentence_index'], row['word_index']), axis=1
)

tokenizer, model = load_model("glf")
df["predicted_lex"] = predict(tokenizer, model, df["input_text"].tolist())

📖 Citation

If you use this model in your research, please cite the following paper:

@inproceedings{saeed-habash-2025-lemmatizing,
    title = {Lemmatizing Dialectal Arabic with Sequence-to-Sequence Models},
    author = {Saeed, Mostafa and Habash, Nizar},
    booktitle = {Proceedings of the Third Arabic Natural Language Processing Conference},
    year = {2025},
    address = {Suzhou, China},
    url = {https://aclanthology.org/2025.arabicnlp-main.10/}
}

Downloads last month: 52

Safetensors

Model size

95.7M params

Tensor type

F32