NEDOQwen 0.8B Base Pretrained

NEDOQwen 0.8B Base Pretrained is a Turkish-focused custom decoder-only causal language model trained for the NEDO Turkish SLM project.

This repository contains a custom PyTorch checkpoint, not a Hugging Face Transformers-native model yet.

Summary

Parameters: 824,256,000
Architecture: Qwen/Llama-style decoder-only causal LM
Language: Turkish
Tokenizer: NEDO Turkish Tokenizer, 65K typed_surface vocabulary
Context length: 1024 tokens
Training dataset: Ethosoft/nedo-turkish-65k-tokenized-60b
Checkpoint file: checkpoint.pt
Final pretraining step: 5000
Training tokens seen in this run: 1,310,720,000
Validation loss: 1.5961
Perplexity: 4.93

Architecture

Configuration:

vocab_size: 65536
dim: 1536
n_layers: 24
n_heads: 16
n_kv_heads: 8
mlp_dim: 4096
block_size: 1024
tie_embeddings: false

Model components:

decoder-only causal language model
RMSNorm
RoPE positional embeddings
grouped-query attention
SwiGLU MLP
pre-norm transformer blocks
bias-free linear layers

Important compatibility note

This is not yet a standard Hugging Face Transformers checkpoint.

The following will not work yet:

AutoModelForCausalLM.from_pretrained(...)

Use the included custom model/sampling scripts for loading and generation.

Files

checkpoint.pt: custom PyTorch checkpoint
config.json: model architecture configuration
scripts/20_train_qwen_style.py: model definition and pretraining script
scripts/30_sample_qwen_style.py: sampling script
tokenizer/vocab_65536.jsonl: NEDO Turkish tokenizer vocabulary
metadata/model_info.json: checkpoint metadata
metadata/run_config.json: training run configuration, if available

Training data

This model was trained on:

Ethosoft/nedo-turkish-65k-tokenized-60b

That dataset is a tokenized Turkish web-corpus snapshot containing approximately 60.95B uint16 tokens.

Training notes

This checkpoint corresponds to the stable 4xH200 base-pretraining run.

Recorded final evaluation:

step: 5000
val_loss: 1.5961
ppl: 4.93

A later 8xH200 continued-pretraining experiment exists in the project history, but the best logged evaluation checkpoint from that run was not saved as a file. Therefore this repository releases the stable saved base-pretraining checkpoint.

Recommended use

This checkpoint is best used for:

continued Turkish pretraining
supervised fine-tuning experiments
Turkish small language model research
tokenizer/model ablation studies
reproducibility of the NEDO Turkish SLM pipeline

Not intended as

a production assistant model
a safety-aligned chatbot
a Transformers-native checkpoint
a fully benchmarked general-purpose model

Known limitations

Not HF Transformers-compatible yet
No production KV-cache inference wrapper
Instruction-following is weak in the base model
Can repeat under open-ended prompting
Systematic benchmark evaluation is still incomplete

Citation and attribution

If you use this model, please attribute:

NEDO Turkish SLM project
NEDO Turkish 65K Tokenized Web Corpus
NEDO Turkish Tokenizer
FineWeb / FineWeb-style upstream data sources where applicable

Suggested attribution:

NEDOQwen 0.8B Base Pretrained.
Turkish decoder-only SLM trained with the NEDO Turkish 65K tokenizer
on the NEDO Turkish 65K Tokenized Web Corpus.

Related repositories

Pretraining dataset: Ethosoft/nedo-turkish-65k-tokenized-60b
SFT datasets: Ethosoft/nedo-turkish-sft-mixtures

Downloads last month: 249

Model tree for Ethosoft/nedoqwen_0.8b_base_pretrained

Finetunes

1 model

Ethosoft
/

nedoqwen_0.8b_base_pretrained