You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Kazakh Pythia 14M — DAPT (Soz)

Domain-adaptive pretraining (DAPT) of EleutherAI/pythia-14m on Kazakh text. The first Kazakh language model experiment in the Soz project — a proof of concept.

Overview

Property Value
Base model EleutherAI/pythia-14m
Training steps 13,000
Method Domain-adaptive pretraining
Language Kazakh
License Apache 2.0

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-14m")
model = AutoModelForCausalLM.from_pretrained("stukenov/slm-kk-pythia14m-dapt-13k")

input_ids = tokenizer("Қазақстан — ", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Note

This was the first experiment in the Soz project — a proof of concept for Kazakh language modeling. The Pythia-14m base model uses an English-centric tokenizer, making it suboptimal for Kazakh. Later models in the project use Kazakh-native tokenizers.

Project

Part of the Soz — Kazakh Language Models project, a research effort to build open-source language models for Kazakh.

Citation

@misc{tukenov2026soz,
  title={Soz: Small Language Models for Kazakh},
  author={Tukenov, Saken},
  year={2026},
  url={https://huggingface.co/stukenov/slm-kk-pythia14m-dapt-13k}
}

License

Apache 2.0

Downloads last month
2
Safetensors
Model size
14.1M params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stukenov/sozkz-core-pythia-14m-kk-dapt-v1

Finetuned
(2)
this model

Collection including stukenov/sozkz-core-pythia-14m-kk-dapt-v1