SozKZ GEC: Kazakh Grammar Error Correction
Collection
Grammar error correction models and datasets for Kazakh — Llama GEC (300M, 600M), mT5 GEC, morphology models • 10 items • Updated
A Kazakh grammatical error correction (GEC) model fine-tuned from kazakh-llama-50m-v2 on synthetic GEC data (~390K training examples + 20% identity examples).
The model corrects morphological errors (vowel harmony, suffixes), word order issues, and other common grammatical mistakes in Kazakh text.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "stukenov/kazakh-gec-50m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.bfloat16).eval()
text = "Ол кітапті оқыды"
prompt = f"<TASK_FIX><SRC>{text}<SEP>"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=256,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
result = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(f"Input: {text}")
print(f"Output: {result}")
# Input: Ол кітапті оқыды
# Output: Ол кітапты оқыды
The model uses a decode-only seq2seq format with special tokens:
<TASK_FIX><SRC>{noisy text}<SEP>{corrected text}<EOS>
During training, loss is computed only on tokens after <SEP>.
Evaluated on the test split (200 examples) of kazakh-synthetic-gec-datasets:
| Metric | Value |
|---|---|
| Exact Match | 62.0% |
| Character Error Rate (CER) | 0.0802 |
| Word Precision | 0.494 |
| Word Recall | 0.661 |
| Word F0.5 | 0.520 |
| Identity Preservation | 100% (26/26) |
Strengths:
Limitations:
| Input | Output | Fix |
|---|---|---|
| Ол кітапті оқыды | Ол кітапты оқыды | Vowel harmony (ті→ты) |
| Ол кеше базарга барды | Ол кеше базарға барды | Vowel harmony (га→ға) |
| Ол маған жазыды хат | Ол маған хат жазды | Word order + morphology |
| Мен сенің кітабыңды алдым | Мен сенің кітабыңды алдым | No change (correct input) |
| Parameter | Value |
|---|---|
| Base model | kazakh-llama-50m-v2 (Llama) |
| Parameters | ~50M |
| Hidden size | 576 |
| Layers | 8 |
| Attention heads | 8 |
| Vocab size | 50,263 (+3 special tokens) |
| Max sequence length | 512 |
| Token | Purpose |
|---|---|
<TASK_FIX> |
Task prefix |
<SRC> |
Source text delimiter |
<SEP> |
Separator between input and target |
Apache 2.0