You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Kazakh Pythia 14M — DAPT (Soz)

Domain-adaptive pretraining (DAPT) of EleutherAI/pythia-14m on Kazakh text. The first Kazakh language model experiment in the Soz project — a proof of concept.

Overview

Property	Value
Base model	EleutherAI/pythia-14m
Training steps	13,000
Method	Domain-adaptive pretraining
Language	Kazakh
License	Apache 2.0

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-14m")
model = AutoModelForCausalLM.from_pretrained("stukenov/slm-kk-pythia14m-dapt-13k")

input_ids = tokenizer("Қазақстан — ", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Note

This was the first experiment in the Soz project — a proof of concept for Kazakh language modeling. The Pythia-14m base model uses an English-centric tokenizer, making it suboptimal for Kazakh. Later models in the project use Kazakh-native tokenizers.

Project

Part of the Soz — Kazakh Language Models project, a research effort to build open-source language models for Kazakh.

Citation

@misc{tukenov2026soz,
  title={Soz: Small Language Models for Kazakh},
  author={Tukenov, Saken},
  year={2026},
  url={https://huggingface.co/stukenov/slm-kk-pythia14m-dapt-13k}
}

License

Apache 2.0

Downloads last month: 2

Safetensors

Model size

14.1M params

Tensor type

F16

Model tree for stukenov/sozkz-core-pythia-14m-kk-dapt-v1

Base model

EleutherAI/pythia-14m

Finetuned

(2)

this model

Collection including stukenov/sozkz-core-pythia-14m-kk-dapt-v1

SozKZ Core: Kazakh Language Models

Collection

Base, instruct, and balanced Kazakh language models trained from scratch — Llama (50M–600M), GPT2, Pythia architectures • 22 items • Updated 29 days ago