Automatic Restoration of Diacritics for Speech Data Sets

This is a transformer-baed model for Arabic text diacritization as described here.

Evaluation Results

Evaluation on clartts

DER (Diacritic Error Rate)

Configuration	With case ending	Without case ending
Including no diacritic	10.33%	8.45%
Excluding no diacritic	12.72%	10.33%

WER (Word Error Rate)

Configuration	With case ending	Without case ending
Including no diacritic	30.16%	19.71%
Excluding no diacritic	29.91%	19.60%

How to Use

Installation

git clone https://github.com/rufaelfekadu/diac.git
cd diac
pip install -e .

Loading the Model

from diac.models import DiacritizationModule

model = DiacritizationModule.from_pretrained(
    "rufaelfekadu/diac-transformer-text-only-tashkeela",
    tokenizer_constants_path="constants/"  # Path to constants directory
)

Running Inference

# Predict diacritization for a text file
model.predict_file(
    input_file="path/to/input.txt",
    output_file="path/to/output.txt"
)

# Or predict for a single text string
diacritized_text = model.predict_text("مرحبا بك")

Running Evaluation

To evaluate the model on your own test set:

Run inference to generate predictions:

python inference.py \
    --config configs/<model>.yml \
    --opts \
    DATA.TEST_PATH path/to/test.txt \
    INFERENCE.MODEL_PATH <path_to_checkpoint> \
    INFERENCE.OUTPUT_PATH path/to/predictions.txt

Prepare reference file (if needed):

python src/diac/utils/prep_ref.py \
    --input_file path/to/test.txt \
    -o path/to/output_dir

Calculate metrics (DER, WER, SER):

python src/diac/utils/eval.py \
    -ofp path/to/predictions.txt \
    -tfp path/to/reference.txt \
    --style Fadel

The evaluation script will output DER, WER, and SER metrics with different configurations:

With/without case ending
Including/excluding no diacritic

Downloads last month: 3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support