Papers
arxiv:2606.17526

MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization

Published on Jun 16
Authors:
,

Abstract

A novel selective update mechanism called MGUP enhances momentum-based optimizers with controlled parameter step sizes, providing theoretical convergence guarantees and improved training efficiency for large language models.

Efficient optimization is essential for training large language models. Although intra-layer selective updates have been explored, a general mechanism that enables fine-grained control while ensuring convergence guarantees is still lacking. To bridge this gap, we propose MGUP, a novel mechanism for selective updates. MGUP augments standard momentum-based optimizers by applying larger step-sizes to a selected fixed proportion of parameters in each iteration, while applying smaller, non-zero step-sizes to the rest. As a nearly {plug-and-play} module, MGUP seamlessly integrates with optimizers such as AdamW, Lion, and Muon. This yields powerful variants such as MGUP-AdamW, MGUP-Lion, and MGUP-Muon. Under standard assumptions, we provide theoretical convergence guarantees for MGUP-AdamW (without weight decay) in stochastic optimization. Extensive experiments across diverse tasks, including MAE pretraining, LLM pretraining, and downstream fine-tuning, demonstrate that our MGUP-enhanced optimizers achieve superior or more stable performance compared to their original base optimizers. We offer a principled, versatile, and theoretically grounded strategy for efficient intra-layer selective updates, accelerating and stabilizing the training of large-scale models. The code is publicly available at https://github.com/MaeChd/MGUP.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.17526
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.17526 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.17526 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.17526 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.