Auron-510M
Auron — Chimera hybrid GDN-Attention language models with Ouroboros weight sharing.
Paper: Auron: Depth-Efficient Language Models via Hybrid Recurrent-Attention Weight Sharing Code: github.com/Fy-/Auron
Architecture
- Type: Chimera (ChimeraConfig)
- Dim: 1536
- Layers: 16 virtual
- Params: 510,217,280 (510M)
- Vocab: 151936 (Qwen 3 tokenizer)
- Context: 2048 tokens
- Topology: 4 unique bottom + 4×3 shared top
- GDN:Attn ratio: 3:1 (every 4th layer is attention)
- Virtual equivalent: ~1,020,434,560 params
Training Curves
Training
- Step: 249,000
- Data: Mixed (75% FineWeb-Edu, 18% StarCoder, 5% FineMath, 2% UltraChat)
- Optimizer: Muon + AdamW (decoupled embedding LR)
- Schedule: WSD (Warmup-Stable-Decay)
Usage
git clone https://github.com/Fy-/Auron && cd Auron && rye sync
from ouro import load_model, generate
model, tokenizer, device = load_model("nyxia/Auron-510M")
generate(model, tokenizer, device, "The history of")
Sampling
Default: T=0.7, top_k=20, top_p=0.9, rep_pen=1.0, presence_pen=1.5 (Ouroboros weight sharing requires presence penalty >= 1.5 to prevent attractor wells).
Links
- Paper: Auron: Depth-Efficient Language Models via Hybrid Recurrent-Attention Weight Sharing
- Code: github.com/Fy-/Auron
- Models: huggingface.co/nyxia
Built by Florian Gasquez (@nyxia). Part of the Soulkyn project.
- Downloads last month
- 168
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

