NeoLLM

NeoLLM is a 135 M parameter decoder-only language model trained from scratch on FineWeb-Edu in FP8 precision, completing training in approximately 6 hours on a single NVIDIA RTX 5090. It integrates a collection of recently published attention and normalization techniques into a single architecture, with the goal of studying how they interact during pretraining. The model is actively being developed and the current checkpoint represents an intermediate training state.

Author / contact: @Kyokopom on X Repository: KitsuVp/NeoLLM

Architecture

NeoLLM is a decoder-only transformer with the following configuration:

Parameter	Value
Hidden size	512
Layers	12
Attention heads	8
KV heads (GQA)	4
Head dim	64
Intermediate size	1536
Vocabulary	Qwen3 tokenizer (64,402 tokens)
Context length	512 tokens

Parameter breakdown

Parameter bucket	Count
Total parameters	116.22M (116,216,184)
Embedding parameters (tied)	32.97M (32,973,824)
Non-embedding parameters	83.24M (83,242,360)
Effective trainable parameters	116.22M (116,216,184)

Weight tying is enabled: the input embedding matrix and the language-model head share the same parameters, so the effective trainable budget is total − embed = 83.24M.

Integrated techniques

NeoLLM combines architecture modules, optional auxiliary objectives, and training-time optimizer/stability components from the following papers.

Embedding and token representation

Learnable Multipliers (arXiv:2601.04890) — Adds per-row and per-column learnable scalar parameters to selected matrix layers and, when enabled, embeddings.
Leviathan (arXiv:2601.22040) — Optional continuous token embedding generator that can replace the discrete input lookup table.
KHRONOS (arXiv:2505.13315) — Kernel/basis reference used by the Leviathan continuous token generator implementation.
JTok / JTok-M (arXiv:2602.00800) — Optional token-indexed self-modulation surfaces over Leviathan coordinates.
Spelling Bee Embeddings (arXiv:2601.18030) — Augments token embeddings with character-level spelling information.
Token Embedding Manifold analysis (arXiv:2504.01002) — Reference motivation for treating token embeddings as structured objects rather than unconstrained lookup rows.

Attention, positions, and output projection

FAN (arXiv:2502.21309) — Fourier Analysis Networks. A portion of the projection channels are dedicated to periodic cosine/sine features.
MEA (arXiv:2601.19611) — Explicit Multi-head Attention. Adds small learnable interaction matrices between attention heads for K and V.
LUCID (arXiv:2602.10410) — Applies a learned lower-triangular preconditioner to V before attention, decorrelating value representations across positions.
Affine-Scaled Attention (arXiv:2602.23057) — Adds two learnable per-head scalars (α and β) to the softmax weights: [α·softmax(QKᵀ) + β]·V.
XSA (arXiv:2603.09078) — Exclusive Self Attention. After computing attention, removes the component of the output aligned with the token's own value vector.
Directional Routing (arXiv:2603.14923) — Each head learns K=4 directions in the output space; a learned router suppresses the attention output along each direction per input.
Gated Attention (arXiv:2505.06708) — A sigmoid gate is applied to the attention output before the output projection, introducing non-linearity and preventing attention sinks.
Momentum Attention (arXiv:2411.03884) — Modifies Q and K by subtracting a fraction of the previous position's Q and K values (causal first-difference).
Interleaved Head Attention / IHA (arXiv:2602.21371) — Builds pseudo-heads from learned cross-head mixtures to create multiple attention patterns per original head.
REPO (arXiv:2512.14391) — Context re-positioning module that learns contextual position coordinates above a configurable start layer.
GRAPE (arXiv:2512.07805) — Group representational position encoding used by the REPO-GRAPE positional path.
GOAT priors (arXiv:2601.15380) — Optional factorized attention log-prior channels inspired by trainable attention priors.
Hadamard output projection (arXiv:2603.08343) — Replaces dense attention output projection with a structured Hadamard transform plus lightweight scaling.

Normalization, residual flow, and MLP

SeeDNorm (arXiv:2510.22777) — Applied to Q and K projections. Dynamically rescales normalization from the input's own statistics.
LayerNorm Scaling / LNS (arXiv:2502.05795) — Each layer's output is scaled by 1/√ℓ where ℓ is the layer index.
GPAS (arXiv:2506.22049) — Gradient-Preserving Activation Scaling for residual junctions.
PolyNorm (arXiv:2602.04902) — Replaces the standard MLP activation with normalized linear, quadratic, and cubic branches.
SimpleGPT (arXiv:2602.01212) — Second-order geometry-inspired normalization strategy applied inside MLP projections.
StackMemory / STACKTRANS (NeurIPS 2025) — Optional differentiable hidden-state stack between decoder layers.
Attention Residuals / AttnRes (arXiv:2603.15031) — Optional learned depth-wise aggregation over previous layer outputs or block summaries.
LAUREL (arXiv:2411.07501) — Optional learned augmented residual layer with residual-weight and low-rank variants.

Training objectives and training-time regularizers

TWEO (arXiv:2511.23225) — Optional Transformers Without Extreme Outliers activation regularizer for FP8/low-bit-friendly training.
NITP (arXiv:2605.24956) — Optional Next Implicit Token Prediction auxiliary objective using shallow-layer implicit token targets and a cosine loss.
NextLat (arXiv:2511.05963) — Optional next-latent prediction objective using latent dynamics, Smooth L1 supervision, and frozen-head KL.

Optimizer and training stability

Conda (arXiv:2509.24218) — Column-Normalized Adam optimizer path used by the training script.
Cautious Weight Decay (arXiv:2510.12402) — Sign-selective weight decay variant used by the custom optimizer logic.
Correction of Decoupled Weight Decay (arXiv:2512.08217) — Adapts decoupled weight decay during learning-rate decay.
AdamHD (arXiv:2511.14721) — Decoupled Huber decay regularization reference used by the optimizer.
GradientStabilizer (arXiv:2502.17055) — Optional threshold-free gradient magnitude stabilizer.

Training

Setting	Value
Dataset	FineWeb-Edu (sample-10BT)
Tokens seen	~0.03B (782 steps × batch 64 × length 512)
Precision	FP8 native (E4M3 weights/activations, E5M2 gradients) + BF16 fallback
Optimizer	Conda (Column-Normalized Adam)
Learning rate	6e-04 with linear warmup (10 % of steps)
Weight decay	0.1
Training time	~0h 29m
Hardware	NVIDIA RTX 5090 (single GPU)

Training curve

Step	Train Loss	Val Loss
782	—	5.565

Limitations

Token budget — ~1.5 B tokens seen; below estimated optimum. Knowledge-intensive tasks will improve with more training.
Gradient spike at step 40k — Reorganized the attention pattern in layer 9 that previously captured long-range token correlations. A checkpoint from ~step 38k is expected to have better aggregate benchmark scores.
PolyNorm exclusivity — The quadratic branch has become partially redundant with the linear branch. Will be corrected in the next training run.
Base model only — Not instruction-tuned or aligned; purely a next-token-prediction base model.

References

All papers whose techniques are integrated into NeoLLM's architecture, training objective, or training stack:

Area	Technique	Paper title	Reference
Embeddings	Learnable Multipliers	Freeing the Scale of Language Model Matrix Layers	arXiv:2601.04890
Embeddings	Leviathan	A Separable Architecture for Continuous Token Representation in Language Models	arXiv:2601.22040
Embeddings	KHRONOS	KHRONOS: a Kernel-Based Neural Architecture for Rapid, Resource-Efficient Scientific Computation	arXiv:2505.13315
Embeddings	JTok / JTok-M	JTok: On Token Embedding as Another Axis of Scaling Law via Joint Token Self-Modulation	arXiv:2602.00800
Embeddings	Spelling Bee	Spelling Bee Embeddings for Language Modeling	arXiv:2601.18030
Embeddings	Token embedding analysis	Token Embeddings Violate the Manifold Hypothesis	arXiv:2504.01002
Attention / positions	FAN	Fourier Analysis Networks	arXiv:2502.21309
Attention / positions	MEA	Explicit Multi-head Attention for Inter-head Interaction in Large Language Models	arXiv:2601.19611
Attention / positions	LUCID	Attention with Preconditioned Representations	arXiv:2602.10410
Attention / positions	Affine-Scaled Attention	Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention	arXiv:2602.23057
Attention / positions	XSA	Exclusive Self Attention	arXiv:2603.09078
Attention / positions	Directional Routing	Directional Routing in Transformers	arXiv:2603.14923
Attention / positions	Gated Attention	Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free	arXiv:2505.06708
Attention / positions	Momentum Attention	Momentum Attention	arXiv:2411.03884
Attention / positions	IHA	Interleaved Head Attention	arXiv:2602.21371
Attention / positions	REPO	Language Models with Context Re-Positioning	arXiv:2512.14391
Attention / positions	GRAPE	Group Representational Position Encoding	arXiv:2512.07805
Attention / positions	GOAT priors	You Need Better Attention Priors	arXiv:2601.15380
Attention / positions	Hadamard o_proj	Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers	arXiv:2603.08343
Residual / normalization	SeeDNorm	Self-Rescaled Dynamic Normalization	arXiv:2510.22777
Residual / normalization	LNS	The Curse of Depth in LLMs	arXiv:2502.05795
Residual / normalization	GPAS	Gradient-Preserving Activation Scaling	arXiv:2506.22049
Residual / normalization	PolyNorm	PolyNorm / PolyCom	arXiv:2602.04902
Residual / normalization	SimpleGPT	SimpleGPT	arXiv:2602.01212
Residual / normalization	StackMemory / STACKTRANS	Recursive Transformer: Boosting Reasoning Ability with State Stack	NeurIPS 2025
Residual / normalization	Attention Residuals	Attention Residuals	arXiv:2603.15031
Residual / normalization	LAUREL	LAUREL: Learned Augmented Residual Layer	arXiv:2411.07501
Objectives	TWEO	Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies	arXiv:2511.23225
Objectives	NITP	Next Implicit Token Prediction for LLM Pre-training	arXiv:2605.24956
Objectives	NextLat	Next-Latent Prediction Transformers Learn Compact World Models	arXiv:2511.05963
Optimizer / training	Conda	Column-Normalized Adam for Training Large Language Models Faster	arXiv:2509.24218
Optimizer / training	CWD	Cautious Weight Decay	arXiv:2510.12402
Optimizer / training	WD correction	Correction of Decoupled Weight Decay	arXiv:2512.08217
Optimizer / training	AdamHD	AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training	arXiv:2511.14721
Optimizer / training	GradientStabilizer	GradientStabilizer	arXiv:2502.17055

Citation

@misc{neollm2026,
  title  = {NeoLLM: A Research Language Model Integrating Recent Attention and Normalization Techniques},
  author = {KitsuVp},
  year   = {2026},
  url    = {https://huggingface.co/KitsuVp/NeoLLM}
}

Author

@Kyokopom on X

License

Apache 2.0

Downloads last month: 821

Safetensors

Model size

0.1B params

Tensor type

I64

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train KitsuVp/NeoLLM

Papers for KitsuVp/NeoLLM