IPA Transcriptor 300M

Fine-tuned google/byt5-small for English word to IPA transcription.

Task format:

ipa: analytical  ->  ˌænəˈlɪtɪkəl

Training data: English phonetic and syllable count dictionary, 125,925 word–IPA pairs after cleaning.

GitHub: pymlex/ipa-transcriptor-300M

Benchmark

Fine-tuned on NVIDIA L4, run colab_l4_bf16, beam search num_beams=4.

Metric Validation Test
n_samples 6296 6297
loss 0.1515 0.1478
perplexity 1.1636 1.1592
token_accuracy 0.7849 0.7858
exact_match 0.5982 0.6111
char_accuracy 0.8948 0.8969
cer 0.1066 0.1045
bleu 59.82 61.11

Dataset length distributions

Word length

IPA length

Training loss

Loss linear

Loss log scale

Inference

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model_id = "pymlex/ipa-transcriptor-300M"
source_prefix = "ipa: "

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def transcribe(word: str, num_beams: int = 4) -> str:
    source = f"{source_prefix}{word.strip().lower()}"
    encoded = tokenizer(source, return_tensors="pt", truncation=True, max_length=36).to(device)
    output_ids = model.generate(**encoded, max_new_tokens=56, num_beams=num_beams, early_stopping=True)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(transcribe("analytical"))
Downloads last month
56
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pymlex/ipa-transcriptor-300M

Finetuned
(242)
this model