Automatic Restoration of Diacritics for Speech Data Sets

This is a transformer-baed model for Arabic text diacritization as described here.

Evaluation Results

Evaluation on clartts

DER (Diacritic Error Rate)

Configuration With case ending Without case ending
Including no diacritic 10.33% 8.45%
Excluding no diacritic 12.72% 10.33%

WER (Word Error Rate)

Configuration With case ending Without case ending
Including no diacritic 30.16% 19.71%
Excluding no diacritic 29.91% 19.60%

How to Use

Installation

git clone https://github.com/rufaelfekadu/diac.git
cd diac
pip install -e .

Loading the Model

from diac.models import DiacritizationModule

model = DiacritizationModule.from_pretrained(
    "rufaelfekadu/diac-transformer-text-only-tashkeela",
    tokenizer_constants_path="constants/"  # Path to constants directory
)

Running Inference

# Predict diacritization for a text file
model.predict_file(
    input_file="path/to/input.txt",
    output_file="path/to/output.txt"
)

# Or predict for a single text string
diacritized_text = model.predict_text("مرحبا بك")

Running Evaluation

To evaluate the model on your own test set:

  1. Run inference to generate predictions:
python inference.py \
    --config configs/<model>.yml \
    --opts \
    DATA.TEST_PATH path/to/test.txt \
    INFERENCE.MODEL_PATH <path_to_checkpoint> \
    INFERENCE.OUTPUT_PATH path/to/predictions.txt
  1. Prepare reference file (if needed):
python src/diac/utils/prep_ref.py \
    --input_file path/to/test.txt \
    -o path/to/output_dir
  1. Calculate metrics (DER, WER, SER):
python src/diac/utils/eval.py \
    -ofp path/to/predictions.txt \
    -tfp path/to/reference.txt \
    --style Fadel

The evaluation script will output DER, WER, and SER metrics with different configurations:

  • With/without case ending
  • Including/excluding no diacritic
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support