Deeplm-105M v2 (Step 19,500)

Indonesian language model with novel architecture combining MLA, MoE, Hyper-Connections, Hybrid Attention, Multi-Token Prediction, Self-Evolution, and autonomous AutoTuner.

Trained on A10G (24GB) for 19,500 steps (~24h) with progressive curriculum, dynamic category sampling, activated reflection+memory+routing algorithms, and energy-based hyperparameter control.

Training Progress (Step 8,000 β†’ 19,500)

Metric Step 8,000 Step 18,000 Step 19,500 Delta (8k→19.5k)
Loss (range) 59.6 Β± 31.2 29.3 – 83.2 31.29 (at step) β€”
Mean Loss (8k+) β€” 53.5 50.68 -2.8
Best Loss (eval) β€” 56.07 56.07 β€”
Curriculum balanced medium hard ↑
Learning Rate 3.84e-04 9.80e-05 6.47e-05 6.0Γ— ↓
Gradient Norm (avg) 21.3 16.8 15.4 -5.9
Throughput 262 tok/s 262 tok/s 249 tok/s -5%
Total Tokens β€” ~263K ~381K +45%

Training Curves

4-panel training curves: loss (with 20-step moving average), learning rate (cosine, log scale), gradient norm (MA-30, log scale), and throughput tokens/s (MA-50). Green = previous upload at step 18,000, Red = current upload at step 19,500. Data logged every 10 steps from step 8,010 to 19,690.

Key Observations (8k β†’ 19.5k)

  • Curriculum progression: balanced β†’ medium β†’ hard β€” loss variance reflects tier transitions
  • Loss range: 28.84 – 85.48 (mean 50.68) β€” diverse curriculum tiers (easy β†’ hard reasoning)
  • Best eval loss: 56.07 held steady (no new eval between 9500β†’19500)
  • LR decay: Cosine schedule from 2.95e-04 β†’ 6.29e-05 at step 19690
  • AutoTuner: Phase changed from balanced β†’ exploitation β€” actively reducing LR/wd for regularization
  • Reflection + Memory + Routing: Activated after step 15,000 β€” adds overhead (249 tok/s vs 262)
  • Gradient norm: Stable at 5–20 range with fewer extreme spikes as training progresses

Architecture

Component Detail
Total Parameters 104,747,048 (~105M)
Vocabulary 32,000 (BBPE)
Layers 10 Transformer blocks
Hidden Size 512
Feed-Forward 2048 (SwiGLU, 4Γ— hidden)
Attention Heads 8 query heads, 1 KV head (MQA)
Head Dim 128 (64 RoPE + 64 NoPE)
Max Seq Length 4096
RoPE Theta 50,000
Attention MLA (Multi-head Latent Attention)
FFN MoE (4 routed + 1 shared experts, top-k=2)
Residual Hyper-Connections with Sinkhorn routing
Hybrid Attention 3 softmax + 7 Lightning layers
Prediction MTP (Multi-Token Prediction, depth=2, 2 MTP layers)
Self-Evolution Autonomous research loop (100+ rounds)
Embeddings Tied (shared between input/output)
AutoTuner Adaptive energy-based optimizer scheduler
Dtype float32 (Hyper-Connections stability)

Key Innovations

Click to expand architecture details

1. Multi-head Latent Attention (MLA) β€” DeepSeek V4 / Kimi K2.6

  • Q compressed: hidden β†’ q_lora_rank(192) β†’ Layernorm β†’ q_up(8 Γ— 128)
  • KV compressed: hidden β†’ [kv_latent(64) + k_rope(64)] β†’ kv_up β†’ [k_nope(64) + v(128)] Γ— 8 heads
  • Entire KV cache per token: just 128 dims (64 latent + 64 rope) β€” ~8Γ— smaller than standard MHA
  • Decoupled RoPE applied only to 64-dim k_pe, content path stays RoPE-free
  • Absorption trick pre-computes W_UK @ W_UV for faster inference
  • MQA-style: KV decomposed once, expanded to all query heads

2. Mixture of Experts (MoE) β€” DeepSeek V4 / Kimi K2.6

  • 4 routed experts + 1 shared expert (always active, Kimi K2.6 style)
  • Top-k=2 routing: each token activates only 2 experts
  • sqrt(softplus(x)) scoring for numerical stability (DeepSeek V4)
  • Bias-based load balancing (no auxiliary loss, no gradient interference)
  • Per-expert routing bias auto-updates to balance token assignments
  • SwiGLU activation in every expert (fused gate+up projection)
  • Expert affinity memory tracks token-expert history

3. Hyper-Connections with Sinkhorn Routing β€” DeepSeek V4

  • Replaces standard residual connections with learned routing
  • 4 connection types: identity, transform, gate, skip
  • Sinkhorn-Knopp normalization (2 iterations) for doubly-stochastic weights
  • Input-dependent routing via gating network
  • Type-specific learnable biases initialized per config
  • Pre-LayerNorm on layer output before routing

4. Hybrid Attention β€” MiniMax M2.7

  • 3 softmax layers (indices 0, 4, 8): Standard MLA with full causal attention
  • 7 linear layers (1, 2, 3, 5, 6, 7, 9): MLA + LightningAttentionV2 50/50 blend
  • LightningAttentionV2: O(n) complexity with intra-block softmax + inter-block KV product
  • Incremental KV state for efficient autoregressive generation
  • ReLU/Swish activation replaces softmax in linear path

5. Multi-Token Prediction (MTP) β€” DeepSeek V4

  • 2 MTP layers, each predicting 2 tokens ahead (mtp_depth=2)
  • Projection block: Linear β†’ LayerNorm β†’ GELU β†’ Linear + residual skip
  • RoPE positional encoding on reduced dim (hidden/4) for efficiency
  • Tied LM head shares parameters with main embedding layer
  • Chunked computation (chunk_size=16) to avoid full (B, S, V) logits
  • Loss weight: 0.3 Γ— cross-entropy of future token predictions

6. Self-Evolution Framework β€” MiniMax M2.7 / Deeplm

  • Autonomous 8-phase research loop: hypothesis β†’ design β†’ execute β†’ analyze β†’ diagnose β†’ fix β†’ evaluate β†’ decide
  • 100+ autonomous optimization rounds per training cycle
  • 3 feedback chain episodes for meta-learning

7. AutoTuner β€” Deeplm custom

  • Energy-based adaptive hyperparameter controller
  • Phase-aware dynamics (warmup β†’ exploration β†’ balanced β†’ exploitation)
  • Bayesian dynamics model: uncertainty-aware lr/wd sensitivity (Welford variance)
  • Multi-timescale loss EMAs (short=0.9, med=0.98, long=0.995)
  • Gradient noise scale monitoring
  • Cosine similarity for gradient direction tracking
  • Layer health monitoring with per-group gradient ratios
  • Failure-aware rollback with revive mechanism
  • Strategic planner: multi-step scheduled adjustments with plan accuracy tracking
  • Dual-window trajectory predictor: regime change detection, convergence estimation

Training Configuration

Config Value
Dataset Wikipedia-id (Indonesian) + GLM-5.1 (English reasoning) + English Wikipedia
Tokenizer 32K BBPE
Optimizer SGD Nesterov (momentum=0.9, weight_decay=0.1)
LR Schedule Cosine (warmup 3%)
Base LR 3e-4
Effective Batch 36 (12 Γ— 3 grad_accum)
Sequence Length 2048
Max Grad Norm 1.0 (auto-tuned)
Total Steps 19,500
GPU A10G (24GB)
Dtype float32
Curriculum 4-tier (easy β†’ medium β†’ hard β†’ reasoning), current: hard
Dynamic Mix Adaptive per-category sampling weights, applied via WeightedBucketSampler
Tokenization Disk-cached (SHA-256 keyed), no re-tokenization per epoch
Filtering StrictFilter: URL/HTML/emoji stripping + char ratio + language score + repetition + min words
Batching BucketDataset: groups by length for efficient padding

Training Algorithms

Algorithm Status Description
Curriculum Learning Active 4-tier easy→hard progression by text length
Dynamic Sampling Active Adaptive category mix based on per-category loss
Difficulty Scheduling Active 4 phases: Token Learning β†’ Syntax β†’ Reasoning β†’ Expert
MoE Balancing Active Bias-based load-balanced routing
AutoTuner Active AI adaptive hyperparameter control
MTP Active Auxiliary multi-token prediction loss
Curriculum Scheduling Active Loss-based adaptive difficulty
Reflection Training Active High-loss example replay (1,500 stored)
Memory Algorithms Active 1,500 stored, avg loss 10.1
Tool Routing Active Code=706, Math=205, Formal=587 routed
Synthetic Evolution Inactive Model-generated training data (potential A10G bottleneck)

AutoTuner State (Step 19,500)

Metric Step 18,000 Step 19,500 Change
Phase Balanced Exploitation ↑ aggressiveness
LR Multiplier 0.78Γ— 0.64Γ— ↓ 18%
Grad Norm Multiplier 0.76Γ— 0.64Γ— ↓ 16%
Weight Decay Mult 1.60Γ— active regularization
Best (smoothed loss) 28.84 4.01 ↓ (different scale)
Best Eval Loss 56.07 56.07 β€” (no new eval)
Adjustments Made β€” 152 learned control
Degeneracy Reductions β€” 2 prevented divergence
Cosine Similarity EMA β€” 0.15 moderate direction stability
Gradient Noise EMA β€” 0.10 low noise
Gradient Norm (avg) β€” 3.45 well-controlled
Diagnosis Overfitting Exploitation phase-consistent
Plan Strategy Regularize regularization ongoing β€”
Plan Accuracy 0.04 β€” exploratory phase
Trajectory Slope +1.85 (rΒ²=0.08) β€” high variance
Mix Weights short=5.6%, med=40.8%, long=30.4%, vlong=23.2% short=43.3%, med=24.5%, long=18.3%, vlong=13.9% shifted to short
Curriculum medium hard ↑ difficulty

The AutoTuner has entered exploitation phase at step 19,500 β€” reducing LR to 6.35e-5 (0.64Γ— base), grad clip to 0.64Γ—, increasing weight decay for regularization. The multi-timescale EMAs (short=10.3, med=10.3, long=10.3) indicate stable convergence at the underlying dynamics level despite curriculum tier transitions causing high surface loss variance.

Routing Activity (Step 19,500)

Route Count Avg Performance
Code 706 10.36
Math 205 10.29
Formal 587 10.36
Creative 1 10.83
Dialog 1 9.48

Routing algorithms are actively classifying training examples by type, with code and formal reasoning dominating the mix.

Data Pipeline (New in v2)

  • StrictFilter: Multi-layer text quality filter β€” URL/HTML/emoji stripping β†’ char ratio β‰₯0.25 β†’ language score β‰₯0.001 β†’ 4-gram repetition ≀0.4 β†’ min 10 words
  • TokenCache: SHA-256 keyed disk cache β€” tokenize once per unique text, no re-tokenization across epochs
  • BucketDataset: Groups texts by similar length (bucket_size=64) to minimize padding waste
  • WeightedBucketSampler: Importance sampling by category weights, synced from DynamicSampler every 500 steps

Files

File Description
model.pt Model weights (~105M params, 419MB) β€” step 19,500
best.pt Best checkpoint by eval loss
training_state.json Full training state including AutoTuner state
tokenizer.json BBPE tokenizer (32K vocab)
tokenizer_config.json Tokenizer configuration
config.yaml Model configuration (DeeplmConfig defaults)
training_curve_8k_20k.png Updated training curves: step 8,010 β†’ 19,690

Usage

import torch
from deeplm.config import DeeplmConfig
from deeplm.model.deeplm import DeeplmModel

config = DeeplmConfig()
model = DeeplmModel(config)
model.load_state_dict(torch.load("model.pt", map_location="cpu"), strict=False)
model.eval()

input_ids = torch.tensor([[1, 2, 3]])
output = model.generate(
    input_ids,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.9,
)
print(output)
Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support