Deeplm-105M v2 (Step 19,500)

Indonesian language model with novel architecture combining MLA, MoE, Hyper-Connections, Hybrid Attention, Multi-Token Prediction, Self-Evolution, and autonomous AutoTuner.

Trained on A10G (24GB) for 19,500 steps (~24h) with progressive curriculum, dynamic category sampling, activated reflection+memory+routing algorithms, and energy-based hyperparameter control.

Training Progress (Step 8,000 → 19,500)

Metric	Step 8,000	Step 18,000	Step 19,500	Delta (8k→19.5k)
Loss (range)	59.6 ± 31.2	29.3 – 83.2	31.29 (at step)	—
Mean Loss (8k+)	—	53.5	50.68	-2.8
Best Loss (eval)	—	56.07	56.07	—
Curriculum	balanced	medium	hard	↑
Learning Rate	3.84e-04	9.80e-05	6.47e-05	6.0× ↓
Gradient Norm (avg)	21.3	16.8	15.4	-5.9
Throughput	262 tok/s	262 tok/s	249 tok/s	-5%
Total Tokens	—	~263K	~381K	+45%

4-panel training curves: loss (with 20-step moving average), learning rate (cosine, log scale), gradient norm (MA-30, log scale), and throughput tokens/s (MA-50). Green = previous upload at step 18,000, Red = current upload at step 19,500. Data logged every 10 steps from step 8,010 to 19,690.

Key Observations (8k → 19.5k)

Curriculum progression: balanced → medium → hard — loss variance reflects tier transitions
Loss range: 28.84 – 85.48 (mean 50.68) — diverse curriculum tiers (easy → hard reasoning)
Best eval loss: 56.07 held steady (no new eval between 9500→19500)
LR decay: Cosine schedule from 2.95e-04 → 6.29e-05 at step 19690
AutoTuner: Phase changed from balanced → exploitation — actively reducing LR/wd for regularization
Reflection + Memory + Routing: Activated after step ~~15,000 — adds overhead (~~249 tok/s vs 262)
Gradient norm: Stable at 5–20 range with fewer extreme spikes as training progresses

Architecture

Component	Detail
Total Parameters	104,747,048 (~105M)
Vocabulary	32,000 (BBPE)
Layers	10 Transformer blocks
Hidden Size	512
Feed-Forward	2048 (SwiGLU, 4× hidden)
Attention Heads	8 query heads, 1 KV head (MQA)
Head Dim	128 (64 RoPE + 64 NoPE)
Max Seq Length	4096
RoPE Theta	50,000
Attention	MLA (Multi-head Latent Attention)
FFN	MoE (4 routed + 1 shared experts, top-k=2)
Residual	Hyper-Connections with Sinkhorn routing
Hybrid Attention	3 softmax + 7 Lightning layers
Prediction	MTP (Multi-Token Prediction, depth=2, 2 MTP layers)
Self-Evolution	Autonomous research loop (100+ rounds)
Embeddings	Tied (shared between input/output)
AutoTuner	Adaptive energy-based optimizer scheduler
Dtype	float32 (Hyper-Connections stability)

Key Innovations

Click to expand architecture details

1. Multi-head Latent Attention (MLA) — DeepSeek V4 / Kimi K2.6

Q compressed: hidden → q_lora_rank(192) → Layernorm → q_up(8 × 128)
KV compressed: hidden → [kv_latent(64) + k_rope(64)] → kv_up → [k_nope(64) + v(128)] × 8 heads
Entire KV cache per token: just 128 dims (64 latent + 64 rope) — ~8× smaller than standard MHA
Decoupled RoPE applied only to 64-dim k_pe, content path stays RoPE-free
Absorption trick pre-computes W_UK @ W_UV for faster inference
MQA-style: KV decomposed once, expanded to all query heads

2. Mixture of Experts (MoE) — DeepSeek V4 / Kimi K2.6

4 routed experts + 1 shared expert (always active, Kimi K2.6 style)
Top-k=2 routing: each token activates only 2 experts
sqrt(softplus(x)) scoring for numerical stability (DeepSeek V4)
Bias-based load balancing (no auxiliary loss, no gradient interference)
Per-expert routing bias auto-updates to balance token assignments
SwiGLU activation in every expert (fused gate+up projection)
Expert affinity memory tracks token-expert history

3. Hyper-Connections with Sinkhorn Routing — DeepSeek V4

Replaces standard residual connections with learned routing
4 connection types: identity, transform, gate, skip
Sinkhorn-Knopp normalization (2 iterations) for doubly-stochastic weights
Input-dependent routing via gating network
Type-specific learnable biases initialized per config
Pre-LayerNorm on layer output before routing

4. Hybrid Attention — MiniMax M2.7

3 softmax layers (indices 0, 4, 8): Standard MLA with full causal attention
7 linear layers (1, 2, 3, 5, 6, 7, 9): MLA + LightningAttentionV2 50/50 blend
LightningAttentionV2: O(n) complexity with intra-block softmax + inter-block KV product
Incremental KV state for efficient autoregressive generation
ReLU/Swish activation replaces softmax in linear path

5. Multi-Token Prediction (MTP) — DeepSeek V4

2 MTP layers, each predicting 2 tokens ahead (mtp_depth=2)
Projection block: Linear → LayerNorm → GELU → Linear + residual skip
RoPE positional encoding on reduced dim (hidden/4) for efficiency
Tied LM head shares parameters with main embedding layer
Chunked computation (chunk_size=16) to avoid full (B, S, V) logits
Loss weight: 0.3 × cross-entropy of future token predictions

6. Self-Evolution Framework — MiniMax M2.7 / Deeplm

Autonomous 8-phase research loop: hypothesis → design → execute → analyze → diagnose → fix → evaluate → decide
100+ autonomous optimization rounds per training cycle
3 feedback chain episodes for meta-learning

7. AutoTuner — Deeplm custom

Energy-based adaptive hyperparameter controller
Phase-aware dynamics (warmup → exploration → balanced → exploitation)
Bayesian dynamics model: uncertainty-aware lr/wd sensitivity (Welford variance)
Multi-timescale loss EMAs (short=0.9, med=0.98, long=0.995)
Gradient noise scale monitoring
Cosine similarity for gradient direction tracking
Layer health monitoring with per-group gradient ratios
Failure-aware rollback with revive mechanism
Strategic planner: multi-step scheduled adjustments with plan accuracy tracking
Dual-window trajectory predictor: regime change detection, convergence estimation

Training Configuration

Config	Value
Dataset	Wikipedia-id (Indonesian) + GLM-5.1 (English reasoning) + English Wikipedia
Tokenizer	32K BBPE
Optimizer	SGD Nesterov (momentum=0.9, weight_decay=0.1)
LR Schedule	Cosine (warmup 3%)
Base LR	3e-4
Effective Batch	36 (12 × 3 grad_accum)
Sequence Length	2048
Max Grad Norm	1.0 (auto-tuned)
Total Steps	19,500
GPU	A10G (24GB)
Dtype	float32
Curriculum	4-tier (easy → medium → hard → reasoning), current: hard
Dynamic Mix	Adaptive per-category sampling weights, applied via WeightedBucketSampler
Tokenization	Disk-cached (SHA-256 keyed), no re-tokenization per epoch
Filtering	StrictFilter: URL/HTML/emoji stripping + char ratio + language score + repetition + min words
Batching	BucketDataset: groups by length for efficient padding

Training Algorithms

Algorithm	Status	Description
Curriculum Learning	Active	4-tier easy→hard progression by text length
Dynamic Sampling	Active	Adaptive category mix based on per-category loss
Difficulty Scheduling	Active	4 phases: Token Learning → Syntax → Reasoning → Expert
MoE Balancing	Active	Bias-based load-balanced routing
AutoTuner	Active	AI adaptive hyperparameter control
MTP	Active	Auxiliary multi-token prediction loss
Curriculum Scheduling	Active	Loss-based adaptive difficulty
Reflection Training	Active	High-loss example replay (1,500 stored)
Memory Algorithms	Active	1,500 stored, avg loss 10.1
Tool Routing	Active	Code=706, Math=205, Formal=587 routed
Synthetic Evolution	Inactive	Model-generated training data (potential A10G bottleneck)

AutoTuner State (Step 19,500)

Metric	Step 18,000	Step 19,500	Change
Phase	Balanced	Exploitation	↑ aggressiveness
LR Multiplier	0.78×	0.64×	↓ 18%
Grad Norm Multiplier	0.76×	0.64×	↓ 16%
Weight Decay Mult	1.60×	active	regularization
Best (smoothed loss)	28.84	4.01	↓ (different scale)
Best Eval Loss	56.07	56.07	— (no new eval)
Adjustments Made	—	152	learned control
Degeneracy Reductions	—	2	prevented divergence
Cosine Similarity EMA	—	0.15	moderate direction stability
Gradient Noise EMA	—	0.10	low noise
Gradient Norm (avg)	—	3.45	well-controlled
Diagnosis	Overfitting	Exploitation	phase-consistent
Plan Strategy	Regularize	regularization ongoing	—
Plan Accuracy	0.04	—	exploratory phase
Trajectory Slope	+1.85 (r²=0.08)	—	high variance
Mix Weights	short=5.6%, med=40.8%, long=30.4%, vlong=23.2%	short=43.3%, med=24.5%, long=18.3%, vlong=13.9%	shifted to short
Curriculum	medium	hard	↑ difficulty

The AutoTuner has entered exploitation phase at step 19,500 — reducing LR to 6.35e-5 (0.64× base), grad clip to 0.64×, increasing weight decay for regularization. The multi-timescale EMAs (short=10.3, med=10.3, long=10.3) indicate stable convergence at the underlying dynamics level despite curriculum tier transitions causing high surface loss variance.

Routing Activity (Step 19,500)

Route	Count	Avg Performance
Code	706	10.36
Math	205	10.29
Formal	587	10.36
Creative	1	10.83
Dialog	1	9.48

Routing algorithms are actively classifying training examples by type, with code and formal reasoning dominating the mix.

Data Pipeline (New in v2)

StrictFilter: Multi-layer text quality filter — URL/HTML/emoji stripping → char ratio ≥0.25 → language score ≥0.001 → 4-gram repetition ≤0.4 → min 10 words
TokenCache: SHA-256 keyed disk cache — tokenize once per unique text, no re-tokenization across epochs
BucketDataset: Groups texts by similar length (bucket_size=64) to minimize padding waste
WeightedBucketSampler: Importance sampling by category weights, synced from DynamicSampler every 500 steps

Files

File	Description
`model.pt`	Model weights (~105M params, 419MB) — step 19,500
`best.pt`	Best checkpoint by eval loss
`training_state.json`	Full training state including AutoTuner state
`tokenizer.json`	BBPE tokenizer (32K vocab)
`tokenizer_config.json`	Tokenizer configuration
`config.yaml`	Model configuration (DeeplmConfig defaults)
`training_curve_8k_20k.png`	Updated training curves: step 8,010 → 19,690

Usage

import torch
from deeplm.config import DeeplmConfig
from deeplm.model.deeplm import DeeplmModel

config = DeeplmConfig()
model = DeeplmModel(config)
model.load_state_dict(torch.load("model.pt", map_location="cpu"), strict=False)
model.eval()

input_ids = torch.tensor([[1, 2, 3]])
output = model.generate(
    input_ids,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.9,
)
print(output)

Downloads last month: 32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support