Papers
arxiv:2602.19580

Leap+Verify: Regime-Adaptive Speculative Weight Prediction for Accelerating Neural Network Training

Published on Feb 23
Authors:

Abstract

Leap+Verify accelerates neural network training through speculative execution by predicting future model weights and validating them in dynamically detected training regimes, showing that finite-difference predictors outperform momentum-based ones while regime distribution varies significantly between model sizes.

AI-generated summary

We introduce Leap+Verify, a framework that applies speculative execution -- predicting future model weights and validating predictions before acceptance -- to accelerate neural network training. Inspired by speculative decoding in language model inference and by the Automatically Scalable Computation (ASC) architecture for program execution, Leap+Verify decomposes training into three dynamically detected regimes (chaotic, transition, stable) using activation-space cosine similarity as a real-time Lyapunov proxy signal. Within each regime, analytic weight predictors (momentum, linear, quadratic extrapolation) attempt to forecast model parameters K training steps ahead; predictions are accepted only when validated against a held-out loss criterion. We evaluate Leap+Verify on GPT-2 124M and Qwen 2.5-1.5B trained on WikiText-103 across five random seeds, sweeping prediction depth K in {5, 10, 25, 50, 75, 100}. Momentum-based prediction (Adam moment extrapolation) fails catastrophically at both scales, with predicted losses exceeding actuals by 100-10,000x -- a universal norm explosion in optimizer-state extrapolation. Finite-difference predictors (linear, quadratic) succeed where momentum fails: at 124M, they achieve 24% strict acceptance at K=5 in stable regimes; at 1.5B, they achieve 37% strict acceptance in transition regimes. The scale-dependent finding is in regime distribution: GPT-2 124M spends 34% of training in stable regime, while Qwen 1.5B spends 64% in chaotic regime and reaches stable in only 0-2 of 40 checkpoints. Larger models are more predictable when predictable, but less often predictable -- the practical bottleneck shifts from predictor accuracy to regime availability. Cross-seed results are highly consistent (less than 1% validation loss variance), and the three-regime framework produces identical phase boundaries (plus or minus 50 steps) across seeds.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.19580 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.19580 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.19580 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.