Auron

Auron-510M

Auron — Chimera hybrid GDN-Attention language models with Ouroboros weight sharing.

Paper: Auron: Depth-Efficient Language Models via Hybrid Recurrent-Attention Weight Sharing Code: github.com/Fy-/Auron

Architecture

  • Type: Chimera (ChimeraConfig)
  • Dim: 1536
  • Layers: 16 virtual
  • Params: 510,217,280 (510M)
  • Vocab: 151936 (Qwen 3 tokenizer)
  • Context: 2048 tokens
  • Topology: 4 unique bottom + 4×3 shared top
  • GDN:Attn ratio: 3:1 (every 4th layer is attention)
  • Virtual equivalent: ~1,020,434,560 params

Training Curves

Training Curves

Training

  • Step: 249,000
  • Data: Mixed (75% FineWeb-Edu, 18% StarCoder, 5% FineMath, 2% UltraChat)
  • Optimizer: Muon + AdamW (decoupled embedding LR)
  • Schedule: WSD (Warmup-Stable-Decay)

Usage

git clone https://github.com/Fy-/Auron && cd Auron && rye sync
from ouro import load_model, generate

model, tokenizer, device = load_model("nyxia/Auron-510M")
generate(model, tokenizer, device, "The history of")

Sampling

Default: T=0.7, top_k=20, top_p=0.9, rep_pen=1.0, presence_pen=1.5 (Ouroboros weight sharing requires presence penalty >= 1.5 to prevent attractor wells).

Links

Built by Florian Gasquez (@nyxia). Part of the Soulkyn project.

Downloads last month
168
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support