Auron-510M

Auron — Chimera hybrid GDN-Attention language models with Ouroboros weight sharing.

Paper: Auron: Depth-Efficient Language Models via Hybrid Recurrent-Attention Weight Sharing Code: github.com/Fy-/Auron

Architecture

Type: Chimera (ChimeraConfig)
Dim: 1536
Layers: 16 virtual
Params: 510,217,280 (510M)
Vocab: 151936 (Qwen 3 tokenizer)
Context: 2048 tokens
Topology: 4 unique bottom + 4×3 shared top
GDN:Attn ratio: 3:1 (every 4th layer is attention)
Virtual equivalent: ~1,020,434,560 params

Training Curves

Training

Step: 249,000
Data: Mixed (75% FineWeb-Edu, 18% StarCoder, 5% FineMath, 2% UltraChat)
Optimizer: Muon + AdamW (decoupled embedding LR)
Schedule: WSD (Warmup-Stable-Decay)

Usage

git clone https://github.com/Fy-/Auron && cd Auron && rye sync

from ouro import load_model, generate

model, tokenizer, device = load_model("nyxia/Auron-510M")
generate(model, tokenizer, device, "The history of")

Sampling

Default: T=0.7, top_k=20, top_p=0.9, rep_pen=1.0, presence_pen=1.5 (Ouroboros weight sharing requires presence penalty >= 1.5 to prevent attractor wells).

nyxia
/

Auron-510M