You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

kazakh-gec-50m

A Kazakh grammatical error correction (GEC) model fine-tuned from kazakh-llama-50m-v2 on synthetic GEC data (~390K training examples + 20% identity examples).

The model corrects morphological errors (vowel harmony, suffixes), word order issues, and other common grammatical mistakes in Kazakh text.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "stukenov/kazakh-gec-50m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.bfloat16).eval()

text = "Ол кітапті оқыды"
prompt = f"<TASK_FIX><SRC>{text}<SEP>"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=256,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )

result = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(f"Input:  {text}")
print(f"Output: {result}")
# Input:  Ол кітапті оқыды
# Output: Ол кітапты оқыды

Format

The model uses a decode-only seq2seq format with special tokens:

<TASK_FIX><SRC>{noisy text}<SEP>{corrected text}<EOS>

During training, loss is computed only on tokens after <SEP>.

Evaluation

Evaluated on the test split (200 examples) of kazakh-synthetic-gec-datasets:

Metric Value
Exact Match 62.0%
Character Error Rate (CER) 0.0802
Word Precision 0.494
Word Recall 0.661
Word F0.5 0.520
Identity Preservation 100% (26/26)

Strengths:

  • Excellent identity preservation — never corrupts already correct text
  • Handles morphological errors well (vowel harmony, suffix agreement)
  • Good at word order corrections

Limitations:

  • Struggles with complex multi-word rearrangements
  • May hallucinate alternative words instead of making minimal corrections on rare vocabulary

Examples

Input Output Fix
Ол кітапті оқыды Ол кітапты оқыды Vowel harmony (ті→ты)
Ол кеше базарга барды Ол кеше базарға барды Vowel harmony (га→ға)
Ол маған жазыды хат Ол маған хат жазды Word order + morphology
Мен сенің кітабыңды алдым Мен сенің кітабыңды алдым No change (correct input)

Architecture

Parameter Value
Base model kazakh-llama-50m-v2 (Llama)
Parameters ~50M
Hidden size 576
Layers 8
Attention heads 8
Vocab size 50,263 (+3 special tokens)
Max sequence length 512

Training

  • Dataset: kazakh-synthetic-gec-datasets — 10 subdirectories, ~390K train / 16K val / 16K test examples
  • Identity examples: 20% of training data (input == target) to prevent over-correction
  • Epochs: 1
  • Batch size: 8 per GPU × 4 GPUs × 4 gradient accumulation = effective 128
  • Learning rate: 2e-5, cosine schedule, 5% warmup
  • Hardware: 4× RTX 4090 (vast.ai)
  • Training time: ~55 minutes
  • Final eval loss: 0.377

Special Tokens

Token Purpose
<TASK_FIX> Task prefix
<SRC> Source text delimiter
<SEP> Separator between input and target

License

Apache 2.0

Downloads last month
7
Safetensors
Model size
50.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including stukenov/sozkz-fix-mt5-50m-kk-gec-v1