NeoLLM

NeoLLM is a 135 M parameter decoder-only language model trained from scratch on FineWeb-Edu in FP8 precision, completing training in approximately 6 hours on a single NVIDIA RTX 5090. It integrates a collection of recently published attention and normalization techniques into a single architecture, with the goal of studying how they interact during pretraining. The model is actively being developed and the current checkpoint represents an intermediate training state.

Author / contact: @Kyokopom on X Repository: KitsuVp/NeoLLM


Architecture

NeoLLM is a decoder-only transformer with the following configuration:

Parameter Value
Hidden size 512
Layers 12
Attention heads 8
KV heads (GQA) 4
Head dim 64
Intermediate size 1536
Vocabulary Qwen3 tokenizer (64,402 tokens)
Context length 512 tokens

Parameter breakdown

Parameter bucket Count
Total parameters 116.22M (116,216,184)
Embedding parameters (tied) 32.97M (32,973,824)
Non-embedding parameters 83.24M (83,242,360)
Effective trainable parameters 116.22M (116,216,184)

Weight tying is enabled: the input embedding matrix and the language-model head share the same parameters, so the effective trainable budget is total − embed = 83.24M.

Integrated techniques

NeoLLM combines architecture modules, optional auxiliary objectives, and training-time optimizer/stability components from the following papers.

Embedding and token representation

  • Learnable Multipliers (arXiv:2601.04890) — Adds per-row and per-column learnable scalar parameters to selected matrix layers and, when enabled, embeddings.
  • Leviathan (arXiv:2601.22040) — Optional continuous token embedding generator that can replace the discrete input lookup table.
  • KHRONOS (arXiv:2505.13315) — Kernel/basis reference used by the Leviathan continuous token generator implementation.
  • JTok / JTok-M (arXiv:2602.00800) — Optional token-indexed self-modulation surfaces over Leviathan coordinates.
  • Spelling Bee Embeddings (arXiv:2601.18030) — Augments token embeddings with character-level spelling information.
  • Token Embedding Manifold analysis (arXiv:2504.01002) — Reference motivation for treating token embeddings as structured objects rather than unconstrained lookup rows.

Attention, positions, and output projection

  • FAN (arXiv:2502.21309) — Fourier Analysis Networks. A portion of the projection channels are dedicated to periodic cosine/sine features.
  • MEA (arXiv:2601.19611) — Explicit Multi-head Attention. Adds small learnable interaction matrices between attention heads for K and V.
  • LUCID (arXiv:2602.10410) — Applies a learned lower-triangular preconditioner to V before attention, decorrelating value representations across positions.
  • Affine-Scaled Attention (arXiv:2602.23057) — Adds two learnable per-head scalars (α and β) to the softmax weights: [α·softmax(QKᵀ) + β]·V.
  • XSA (arXiv:2603.09078) — Exclusive Self Attention. After computing attention, removes the component of the output aligned with the token's own value vector.
  • Directional Routing (arXiv:2603.14923) — Each head learns K=4 directions in the output space; a learned router suppresses the attention output along each direction per input.
  • Gated Attention (arXiv:2505.06708) — A sigmoid gate is applied to the attention output before the output projection, introducing non-linearity and preventing attention sinks.
  • Momentum Attention (arXiv:2411.03884) — Modifies Q and K by subtracting a fraction of the previous position's Q and K values (causal first-difference).
  • Interleaved Head Attention / IHA (arXiv:2602.21371) — Builds pseudo-heads from learned cross-head mixtures to create multiple attention patterns per original head.
  • REPO (arXiv:2512.14391) — Context re-positioning module that learns contextual position coordinates above a configurable start layer.
  • GRAPE (arXiv:2512.07805) — Group representational position encoding used by the REPO-GRAPE positional path.
  • GOAT priors (arXiv:2601.15380) — Optional factorized attention log-prior channels inspired by trainable attention priors.
  • Hadamard output projection (arXiv:2603.08343) — Replaces dense attention output projection with a structured Hadamard transform plus lightweight scaling.

Normalization, residual flow, and MLP

  • SeeDNorm (arXiv:2510.22777) — Applied to Q and K projections. Dynamically rescales normalization from the input's own statistics.
  • LayerNorm Scaling / LNS (arXiv:2502.05795) — Each layer's output is scaled by 1/√ℓ where ℓ is the layer index.
  • GPAS (arXiv:2506.22049) — Gradient-Preserving Activation Scaling for residual junctions.
  • PolyNorm (arXiv:2602.04902) — Replaces the standard MLP activation with normalized linear, quadratic, and cubic branches.
  • SimpleGPT (arXiv:2602.01212) — Second-order geometry-inspired normalization strategy applied inside MLP projections.
  • StackMemory / STACKTRANS (NeurIPS 2025) — Optional differentiable hidden-state stack between decoder layers.
  • Attention Residuals / AttnRes (arXiv:2603.15031) — Optional learned depth-wise aggregation over previous layer outputs or block summaries.
  • LAUREL (arXiv:2411.07501) — Optional learned augmented residual layer with residual-weight and low-rank variants.

Training objectives and training-time regularizers

  • TWEO (arXiv:2511.23225) — Optional Transformers Without Extreme Outliers activation regularizer for FP8/low-bit-friendly training.
  • NITP (arXiv:2605.24956) — Optional Next Implicit Token Prediction auxiliary objective using shallow-layer implicit token targets and a cosine loss.
  • NextLat (arXiv:2511.05963) — Optional next-latent prediction objective using latent dynamics, Smooth L1 supervision, and frozen-head KL.

Optimizer and training stability

  • Conda (arXiv:2509.24218) — Column-Normalized Adam optimizer path used by the training script.
  • Cautious Weight Decay (arXiv:2510.12402) — Sign-selective weight decay variant used by the custom optimizer logic.
  • Correction of Decoupled Weight Decay (arXiv:2512.08217) — Adapts decoupled weight decay during learning-rate decay.
  • AdamHD (arXiv:2511.14721) — Decoupled Huber decay regularization reference used by the optimizer.
  • GradientStabilizer (arXiv:2502.17055) — Optional threshold-free gradient magnitude stabilizer.

Training

Setting Value
Dataset FineWeb-Edu (sample-10BT)
Tokens seen ~0.03B (782 steps × batch 64 × length 512)
Precision FP8 native (E4M3 weights/activations, E5M2 gradients) + BF16 fallback
Optimizer Conda (Column-Normalized Adam)
Learning rate 6e-04 with linear warmup (10 % of steps)
Weight decay 0.1
Training time ~0h 29m
Hardware NVIDIA RTX 5090 (single GPU)

Training curve

Step Train Loss Val Loss
782 5.565

Limitations

  • Token budget — ~1.5 B tokens seen; below estimated optimum. Knowledge-intensive tasks will improve with more training.
  • Gradient spike at step 40k — Reorganized the attention pattern in layer 9 that previously captured long-range token correlations. A checkpoint from ~step 38k is expected to have better aggregate benchmark scores.
  • PolyNorm exclusivity — The quadratic branch has become partially redundant with the linear branch. Will be corrected in the next training run.
  • Base model only — Not instruction-tuned or aligned; purely a next-token-prediction base model.

References

All papers whose techniques are integrated into NeoLLM's architecture, training objective, or training stack:

Area Technique Paper title Reference
Embeddings Learnable Multipliers Freeing the Scale of Language Model Matrix Layers arXiv:2601.04890
Embeddings Leviathan A Separable Architecture for Continuous Token Representation in Language Models arXiv:2601.22040
Embeddings KHRONOS KHRONOS: a Kernel-Based Neural Architecture for Rapid, Resource-Efficient Scientific Computation arXiv:2505.13315
Embeddings JTok / JTok-M JTok: On Token Embedding as Another Axis of Scaling Law via Joint Token Self-Modulation arXiv:2602.00800
Embeddings Spelling Bee Spelling Bee Embeddings for Language Modeling arXiv:2601.18030
Embeddings Token embedding analysis Token Embeddings Violate the Manifold Hypothesis arXiv:2504.01002
Attention / positions FAN Fourier Analysis Networks arXiv:2502.21309
Attention / positions MEA Explicit Multi-head Attention for Inter-head Interaction in Large Language Models arXiv:2601.19611
Attention / positions LUCID Attention with Preconditioned Representations arXiv:2602.10410
Attention / positions Affine-Scaled Attention Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention arXiv:2602.23057
Attention / positions XSA Exclusive Self Attention arXiv:2603.09078
Attention / positions Directional Routing Directional Routing in Transformers arXiv:2603.14923
Attention / positions Gated Attention Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free arXiv:2505.06708
Attention / positions Momentum Attention Momentum Attention arXiv:2411.03884
Attention / positions IHA Interleaved Head Attention arXiv:2602.21371
Attention / positions REPO Language Models with Context Re-Positioning arXiv:2512.14391
Attention / positions GRAPE Group Representational Position Encoding arXiv:2512.07805
Attention / positions GOAT priors You Need Better Attention Priors arXiv:2601.15380
Attention / positions Hadamard o_proj Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers arXiv:2603.08343
Residual / normalization SeeDNorm Self-Rescaled Dynamic Normalization arXiv:2510.22777
Residual / normalization LNS The Curse of Depth in LLMs arXiv:2502.05795
Residual / normalization GPAS Gradient-Preserving Activation Scaling arXiv:2506.22049
Residual / normalization PolyNorm PolyNorm / PolyCom arXiv:2602.04902
Residual / normalization SimpleGPT SimpleGPT arXiv:2602.01212
Residual / normalization StackMemory / STACKTRANS Recursive Transformer: Boosting Reasoning Ability with State Stack NeurIPS 2025
Residual / normalization Attention Residuals Attention Residuals arXiv:2603.15031
Residual / normalization LAUREL LAUREL: Learned Augmented Residual Layer arXiv:2411.07501
Objectives TWEO Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies arXiv:2511.23225
Objectives NITP Next Implicit Token Prediction for LLM Pre-training arXiv:2605.24956
Objectives NextLat Next-Latent Prediction Transformers Learn Compact World Models arXiv:2511.05963
Optimizer / training Conda Column-Normalized Adam for Training Large Language Models Faster arXiv:2509.24218
Optimizer / training CWD Cautious Weight Decay arXiv:2510.12402
Optimizer / training WD correction Correction of Decoupled Weight Decay arXiv:2512.08217
Optimizer / training AdamHD AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training arXiv:2511.14721
Optimizer / training GradientStabilizer GradientStabilizer arXiv:2502.17055

Citation

@misc{neollm2026,
  title  = {NeoLLM: A Research Language Model Integrating Recent Attention and Normalization Techniques},
  author = {KitsuVp},
  year   = {2026},
  url    = {https://huggingface.co/KitsuVp/NeoLLM}
}

Author

@Kyokopom on X


License

Apache 2.0

Downloads last month
821
Safetensors
Model size
0.1B params
Tensor type
I64
·
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train KitsuVp/NeoLLM

Papers for KitsuVp/NeoLLM