NEDOQwen 0.8B Base Pretrained
NEDOQwen 0.8B Base Pretrained is a Turkish-focused custom decoder-only causal language model trained for the NEDO Turkish SLM project.
This repository contains a custom PyTorch checkpoint, not a Hugging Face Transformers-native model yet.
Summary
- Parameters: 824,256,000
- Architecture: Qwen/Llama-style decoder-only causal LM
- Language: Turkish
- Tokenizer: NEDO Turkish Tokenizer, 65K typed_surface vocabulary
- Context length: 1024 tokens
- Training dataset: Ethosoft/nedo-turkish-65k-tokenized-60b
- Checkpoint file: checkpoint.pt
- Final pretraining step: 5000
- Training tokens seen in this run: 1,310,720,000
- Validation loss: 1.5961
- Perplexity: 4.93
Architecture
Configuration:
vocab_size: 65536
dim: 1536
n_layers: 24
n_heads: 16
n_kv_heads: 8
mlp_dim: 4096
block_size: 1024
tie_embeddings: false
Model components:
- decoder-only causal language model
- RMSNorm
- RoPE positional embeddings
- grouped-query attention
- SwiGLU MLP
- pre-norm transformer blocks
- bias-free linear layers
Important compatibility note
This is not yet a standard Hugging Face Transformers checkpoint.
The following will not work yet:
AutoModelForCausalLM.from_pretrained(...)
Use the included custom model/sampling scripts for loading and generation.
Files
- checkpoint.pt: custom PyTorch checkpoint
- config.json: model architecture configuration
- scripts/20_train_qwen_style.py: model definition and pretraining script
- scripts/30_sample_qwen_style.py: sampling script
- tokenizer/vocab_65536.jsonl: NEDO Turkish tokenizer vocabulary
- metadata/model_info.json: checkpoint metadata
- metadata/run_config.json: training run configuration, if available
Training data
This model was trained on:
Ethosoft/nedo-turkish-65k-tokenized-60b
That dataset is a tokenized Turkish web-corpus snapshot containing approximately 60.95B uint16 tokens.
Training notes
This checkpoint corresponds to the stable 4xH200 base-pretraining run.
Recorded final evaluation:
- step: 5000
- val_loss: 1.5961
- ppl: 4.93
A later 8xH200 continued-pretraining experiment exists in the project history, but the best logged evaluation checkpoint from that run was not saved as a file. Therefore this repository releases the stable saved base-pretraining checkpoint.
Recommended use
This checkpoint is best used for:
- continued Turkish pretraining
- supervised fine-tuning experiments
- Turkish small language model research
- tokenizer/model ablation studies
- reproducibility of the NEDO Turkish SLM pipeline
Not intended as
- a production assistant model
- a safety-aligned chatbot
- a Transformers-native checkpoint
- a fully benchmarked general-purpose model
Known limitations
- Not HF Transformers-compatible yet
- No production KV-cache inference wrapper
- Instruction-following is weak in the base model
- Can repeat under open-ended prompting
- Systematic benchmark evaluation is still incomplete
Citation and attribution
If you use this model, please attribute:
- NEDO Turkish SLM project
- NEDO Turkish 65K Tokenized Web Corpus
- NEDO Turkish Tokenizer
- FineWeb / FineWeb-style upstream data sources where applicable
Suggested attribution:
NEDOQwen 0.8B Base Pretrained.
Turkish decoder-only SLM trained with the NEDO Turkish 65K tokenizer
on the NEDO Turkish 65K Tokenized Web Corpus.
Related repositories
- Pretraining dataset: Ethosoft/nedo-turkish-65k-tokenized-60b
- SFT datasets: Ethosoft/nedo-turkish-sft-mixtures
- Downloads last month
- 249