NEDOQwen 0.8B Base Pretrained

NEDOQwen 0.8B Base Pretrained is a Turkish-focused custom decoder-only causal language model trained for the NEDO Turkish SLM project.

This repository contains a custom PyTorch checkpoint, not a Hugging Face Transformers-native model yet.

Summary

  • Parameters: 824,256,000
  • Architecture: Qwen/Llama-style decoder-only causal LM
  • Language: Turkish
  • Tokenizer: NEDO Turkish Tokenizer, 65K typed_surface vocabulary
  • Context length: 1024 tokens
  • Training dataset: Ethosoft/nedo-turkish-65k-tokenized-60b
  • Checkpoint file: checkpoint.pt
  • Final pretraining step: 5000
  • Training tokens seen in this run: 1,310,720,000
  • Validation loss: 1.5961
  • Perplexity: 4.93

Architecture

Configuration:

vocab_size: 65536
dim: 1536
n_layers: 24
n_heads: 16
n_kv_heads: 8
mlp_dim: 4096
block_size: 1024
tie_embeddings: false

Model components:

  • decoder-only causal language model
  • RMSNorm
  • RoPE positional embeddings
  • grouped-query attention
  • SwiGLU MLP
  • pre-norm transformer blocks
  • bias-free linear layers

Important compatibility note

This is not yet a standard Hugging Face Transformers checkpoint.

The following will not work yet:

AutoModelForCausalLM.from_pretrained(...)

Use the included custom model/sampling scripts for loading and generation.

Files

  • checkpoint.pt: custom PyTorch checkpoint
  • config.json: model architecture configuration
  • scripts/20_train_qwen_style.py: model definition and pretraining script
  • scripts/30_sample_qwen_style.py: sampling script
  • tokenizer/vocab_65536.jsonl: NEDO Turkish tokenizer vocabulary
  • metadata/model_info.json: checkpoint metadata
  • metadata/run_config.json: training run configuration, if available

Training data

This model was trained on:

Ethosoft/nedo-turkish-65k-tokenized-60b

That dataset is a tokenized Turkish web-corpus snapshot containing approximately 60.95B uint16 tokens.

Training notes

This checkpoint corresponds to the stable 4xH200 base-pretraining run.

Recorded final evaluation:

  • step: 5000
  • val_loss: 1.5961
  • ppl: 4.93

A later 8xH200 continued-pretraining experiment exists in the project history, but the best logged evaluation checkpoint from that run was not saved as a file. Therefore this repository releases the stable saved base-pretraining checkpoint.

Recommended use

This checkpoint is best used for:

  • continued Turkish pretraining
  • supervised fine-tuning experiments
  • Turkish small language model research
  • tokenizer/model ablation studies
  • reproducibility of the NEDO Turkish SLM pipeline

Not intended as

  • a production assistant model
  • a safety-aligned chatbot
  • a Transformers-native checkpoint
  • a fully benchmarked general-purpose model

Known limitations

  • Not HF Transformers-compatible yet
  • No production KV-cache inference wrapper
  • Instruction-following is weak in the base model
  • Can repeat under open-ended prompting
  • Systematic benchmark evaluation is still incomplete

Citation and attribution

If you use this model, please attribute:

  • NEDO Turkish SLM project
  • NEDO Turkish 65K Tokenized Web Corpus
  • NEDO Turkish Tokenizer
  • FineWeb / FineWeb-style upstream data sources where applicable

Suggested attribution:

NEDOQwen 0.8B Base Pretrained.
Turkish decoder-only SLM trained with the NEDO Turkish 65K tokenizer
on the NEDO Turkish 65K Tokenized Web Corpus.

Related repositories

  • Pretraining dataset: Ethosoft/nedo-turkish-65k-tokenized-60b
  • SFT datasets: Ethosoft/nedo-turkish-sft-mixtures
Downloads last month
249
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ethosoft/nedoqwen_0.8b_base_pretrained

Finetunes
1 model

Dataset used to train Ethosoft/nedoqwen_0.8b_base_pretrained