arxiv:2606.04048

Unlocking Feature Learning in Gated Delta Networks at Scale

Published on Jun 2

· Submitted by

Yifeng Liu on Jun 4

University of California, Los Angeles

Upvote

Authors:

Yifeng Liu ,

Abstract

Scaling rules for Gated Delta Networks are derived through coordinate-size estimation propagation, enabling stable learning-rate transfer across model widths with both AdamW and SGD optimizers.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization (μP) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.