Deeplm-105M v2 (Step 19,500)
Indonesian language model with novel architecture combining MLA, MoE, Hyper-Connections, Hybrid Attention, Multi-Token Prediction, Self-Evolution, and autonomous AutoTuner.
Trained on A10G (24GB) for 19,500 steps (~24h) with progressive curriculum, dynamic category sampling, activated reflection+memory+routing algorithms, and energy-based hyperparameter control.
Training Progress (Step 8,000 β 19,500)
| Metric |
Step 8,000 |
Step 18,000 |
Step 19,500 |
Delta (8kβ19.5k) |
| Loss (range) |
59.6 Β± 31.2 |
29.3 β 83.2 |
31.29 (at step) |
β |
| Mean Loss (8k+) |
β |
53.5 |
50.68 |
-2.8 |
| Best Loss (eval) |
β |
56.07 |
56.07 |
β |
| Curriculum |
balanced |
medium |
hard |
β |
| Learning Rate |
3.84e-04 |
9.80e-05 |
6.47e-05 |
6.0Γ β |
| Gradient Norm (avg) |
21.3 |
16.8 |
15.4 |
-5.9 |
| Throughput |
262 tok/s |
262 tok/s |
249 tok/s |
-5% |
| Total Tokens |
β |
~263K |
~381K |
+45% |

4-panel training curves: loss (with 20-step moving average), learning rate (cosine, log scale), gradient norm (MA-30, log scale), and throughput tokens/s (MA-50). Green = previous upload at step 18,000, Red = current upload at step 19,500. Data logged every 10 steps from step 8,010 to 19,690.
Key Observations (8k β 19.5k)
- Curriculum progression: balanced β medium β hard β loss variance reflects tier transitions
- Loss range: 28.84 β 85.48 (mean 50.68) β diverse curriculum tiers (easy β hard reasoning)
- Best eval loss: 56.07 held steady (no new eval between 9500β19500)
- LR decay: Cosine schedule from 2.95e-04 β 6.29e-05 at step 19690
- AutoTuner: Phase changed from
balanced β exploitation β actively reducing LR/wd for regularization
- Reflection + Memory + Routing: Activated after step
15,000 β adds overhead (249 tok/s vs 262)
- Gradient norm: Stable at 5β20 range with fewer extreme spikes as training progresses
Architecture
| Component |
Detail |
| Total Parameters |
104,747,048 (~105M) |
| Vocabulary |
32,000 (BBPE) |
| Layers |
10 Transformer blocks |
| Hidden Size |
512 |
| Feed-Forward |
2048 (SwiGLU, 4Γ hidden) |
| Attention Heads |
8 query heads, 1 KV head (MQA) |
| Head Dim |
128 (64 RoPE + 64 NoPE) |
| Max Seq Length |
4096 |
| RoPE Theta |
50,000 |
| Attention |
MLA (Multi-head Latent Attention) |
| FFN |
MoE (4 routed + 1 shared experts, top-k=2) |
| Residual |
Hyper-Connections with Sinkhorn routing |
| Hybrid Attention |
3 softmax + 7 Lightning layers |
| Prediction |
MTP (Multi-Token Prediction, depth=2, 2 MTP layers) |
| Self-Evolution |
Autonomous research loop (100+ rounds) |
| Embeddings |
Tied (shared between input/output) |
| AutoTuner |
Adaptive energy-based optimizer scheduler |
| Dtype |
float32 (Hyper-Connections stability) |
Key Innovations
Click to expand architecture details
1. Multi-head Latent Attention (MLA) β DeepSeek V4 / Kimi K2.6
- Q compressed: hidden β q_lora_rank(192) β Layernorm β q_up(8 Γ 128)
- KV compressed: hidden β [kv_latent(64) + k_rope(64)] β kv_up β [k_nope(64) + v(128)] Γ 8 heads
- Entire KV cache per token: just 128 dims (64 latent + 64 rope) β ~8Γ smaller than standard MHA
- Decoupled RoPE applied only to 64-dim k_pe, content path stays RoPE-free
- Absorption trick pre-computes W_UK @ W_UV for faster inference
- MQA-style: KV decomposed once, expanded to all query heads
2. Mixture of Experts (MoE) β DeepSeek V4 / Kimi K2.6
- 4 routed experts + 1 shared expert (always active, Kimi K2.6 style)
- Top-k=2 routing: each token activates only 2 experts
- sqrt(softplus(x)) scoring for numerical stability (DeepSeek V4)
- Bias-based load balancing (no auxiliary loss, no gradient interference)
- Per-expert routing bias auto-updates to balance token assignments
- SwiGLU activation in every expert (fused gate+up projection)
- Expert affinity memory tracks token-expert history
3. Hyper-Connections with Sinkhorn Routing β DeepSeek V4
- Replaces standard residual connections with learned routing
- 4 connection types: identity, transform, gate, skip
- Sinkhorn-Knopp normalization (2 iterations) for doubly-stochastic weights
- Input-dependent routing via gating network
- Type-specific learnable biases initialized per config
- Pre-LayerNorm on layer output before routing
4. Hybrid Attention β MiniMax M2.7
- 3 softmax layers (indices 0, 4, 8): Standard MLA with full causal attention
- 7 linear layers (1, 2, 3, 5, 6, 7, 9): MLA + LightningAttentionV2 50/50 blend
- LightningAttentionV2: O(n) complexity with intra-block softmax + inter-block KV product
- Incremental KV state for efficient autoregressive generation
- ReLU/Swish activation replaces softmax in linear path
5. Multi-Token Prediction (MTP) β DeepSeek V4
- 2 MTP layers, each predicting 2 tokens ahead (mtp_depth=2)
- Projection block: Linear β LayerNorm β GELU β Linear + residual skip
- RoPE positional encoding on reduced dim (hidden/4) for efficiency
- Tied LM head shares parameters with main embedding layer
- Chunked computation (chunk_size=16) to avoid full (B, S, V) logits
- Loss weight: 0.3 Γ cross-entropy of future token predictions
6. Self-Evolution Framework β MiniMax M2.7 / Deeplm
- Autonomous 8-phase research loop: hypothesis β design β execute β analyze β diagnose β fix β evaluate β decide
- 100+ autonomous optimization rounds per training cycle
- 3 feedback chain episodes for meta-learning
7. AutoTuner β Deeplm custom
- Energy-based adaptive hyperparameter controller
- Phase-aware dynamics (warmup β exploration β balanced β exploitation)
- Bayesian dynamics model: uncertainty-aware lr/wd sensitivity (Welford variance)
- Multi-timescale loss EMAs (short=0.9, med=0.98, long=0.995)
- Gradient noise scale monitoring
- Cosine similarity for gradient direction tracking
- Layer health monitoring with per-group gradient ratios
- Failure-aware rollback with revive mechanism
- Strategic planner: multi-step scheduled adjustments with plan accuracy tracking
- Dual-window trajectory predictor: regime change detection, convergence estimation
Training Configuration
| Config |
Value |
| Dataset |
Wikipedia-id (Indonesian) + GLM-5.1 (English reasoning) + English Wikipedia |
| Tokenizer |
32K BBPE |
| Optimizer |
SGD Nesterov (momentum=0.9, weight_decay=0.1) |
| LR Schedule |
Cosine (warmup 3%) |
| Base LR |
3e-4 |
| Effective Batch |
36 (12 Γ 3 grad_accum) |
| Sequence Length |
2048 |
| Max Grad Norm |
1.0 (auto-tuned) |
| Total Steps |
19,500 |
| GPU |
A10G (24GB) |
| Dtype |
float32 |
| Curriculum |
4-tier (easy β medium β hard β reasoning), current: hard |
| Dynamic Mix |
Adaptive per-category sampling weights, applied via WeightedBucketSampler |
| Tokenization |
Disk-cached (SHA-256 keyed), no re-tokenization per epoch |
| Filtering |
StrictFilter: URL/HTML/emoji stripping + char ratio + language score + repetition + min words |
| Batching |
BucketDataset: groups by length for efficient padding |
Training Algorithms
| Algorithm |
Status |
Description |
| Curriculum Learning |
Active |
4-tier easyβhard progression by text length |
| Dynamic Sampling |
Active |
Adaptive category mix based on per-category loss |
| Difficulty Scheduling |
Active |
4 phases: Token Learning β Syntax β Reasoning β Expert |
| MoE Balancing |
Active |
Bias-based load-balanced routing |
| AutoTuner |
Active |
AI adaptive hyperparameter control |
| MTP |
Active |
Auxiliary multi-token prediction loss |
| Curriculum Scheduling |
Active |
Loss-based adaptive difficulty |
| Reflection Training |
Active |
High-loss example replay (1,500 stored) |
| Memory Algorithms |
Active |
1,500 stored, avg loss 10.1 |
| Tool Routing |
Active |
Code=706, Math=205, Formal=587 routed |
| Synthetic Evolution |
Inactive |
Model-generated training data (potential A10G bottleneck) |
AutoTuner State (Step 19,500)
| Metric |
Step 18,000 |
Step 19,500 |
Change |
| Phase |
Balanced |
Exploitation |
β aggressiveness |
| LR Multiplier |
0.78Γ |
0.64Γ |
β 18% |
| Grad Norm Multiplier |
0.76Γ |
0.64Γ |
β 16% |
| Weight Decay Mult |
1.60Γ |
active |
regularization |
| Best (smoothed loss) |
28.84 |
4.01 |
β (different scale) |
| Best Eval Loss |
56.07 |
56.07 |
β (no new eval) |
| Adjustments Made |
β |
152 |
learned control |
| Degeneracy Reductions |
β |
2 |
prevented divergence |
| Cosine Similarity EMA |
β |
0.15 |
moderate direction stability |
| Gradient Noise EMA |
β |
0.10 |
low noise |
| Gradient Norm (avg) |
β |
3.45 |
well-controlled |
| Diagnosis |
Overfitting |
Exploitation |
phase-consistent |
| Plan Strategy |
Regularize |
regularization ongoing |
β |
| Plan Accuracy |
0.04 |
β |
exploratory phase |
| Trajectory Slope |
+1.85 (rΒ²=0.08) |
β |
high variance |
| Mix Weights |
short=5.6%, med=40.8%, long=30.4%, vlong=23.2% |
short=43.3%, med=24.5%, long=18.3%, vlong=13.9% |
shifted to short |
| Curriculum |
medium |
hard |
β difficulty |
The AutoTuner has entered exploitation phase at step 19,500 β reducing LR to 6.35e-5 (0.64Γ base), grad clip to 0.64Γ, increasing weight decay for regularization. The multi-timescale EMAs (short=10.3, med=10.3, long=10.3) indicate stable convergence at the underlying dynamics level despite curriculum tier transitions causing high surface loss variance.
Routing Activity (Step 19,500)
| Route |
Count |
Avg Performance |
| Code |
706 |
10.36 |
| Math |
205 |
10.29 |
| Formal |
587 |
10.36 |
| Creative |
1 |
10.83 |
| Dialog |
1 |
9.48 |
Routing algorithms are actively classifying training examples by type, with code and formal reasoning dominating the mix.
Data Pipeline (New in v2)
- StrictFilter: Multi-layer text quality filter β URL/HTML/emoji stripping β char ratio β₯0.25 β language score β₯0.001 β 4-gram repetition β€0.4 β min 10 words
- TokenCache: SHA-256 keyed disk cache β tokenize once per unique text, no re-tokenization across epochs
- BucketDataset: Groups texts by similar length (bucket_size=64) to minimize padding waste
- WeightedBucketSampler: Importance sampling by category weights, synced from DynamicSampler every 500 steps
Files
| File |
Description |
model.pt |
Model weights (~105M params, 419MB) β step 19,500 |
best.pt |
Best checkpoint by eval loss |
training_state.json |
Full training state including AutoTuner state |
tokenizer.json |
BBPE tokenizer (32K vocab) |
tokenizer_config.json |
Tokenizer configuration |
config.yaml |
Model configuration (DeeplmConfig defaults) |
training_curve_8k_20k.png |
Updated training curves: step 8,010 β 19,690 |
Usage
import torch
from deeplm.config import DeeplmConfig
from deeplm.model.deeplm import DeeplmModel
config = DeeplmConfig()
model = DeeplmModel(config)
model.load_state_dict(torch.load("model.pt", map_location="cpu"), strict=False)
model.eval()
input_ids = torch.tensor([[1, 2, 3]])
output = model.generate(
input_ids,
max_new_tokens=128,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.9,
)
print(output)