arxiv:2602.06563

TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders

Published on Feb 6

Authors:

Abstract

TokenMixer-Large addresses scalability limitations in recommendation models through enhanced residual connections, sparse MoE mechanisms, and stable gradient propagation for deep architectures.

AI-generated summary

While scaling laws for recommendation models have gained significant traction, existing architectures such as Wukong, HiFormer and DHEN, often struggle with sub-optimal designs and hardware under-utilization, limiting their practical scalability. Our previous TokenMixer architecture (introduced in RankMixer paper) addressed effectiveness and efficiency by replacing self-attention with a ightweight token-mixing operator; however, it faced critical bottlenecks in deeper configurations, including sub-optimal residual paths, vanishing gradients, incomplete MoE sparsification and constrained scalability. In this paper, we propose TokenMixer-Large, a systematically evolved architecture designed for extreme-scale recommendation. By introducing a mixing-and-reverting operation, inter-layer residuals and the auxiliary loss, we ensure stable gradient propagation even as model depth increases. Furthermore, we incorporate a Sparse Per-token MoE to enable efficient parameter expansion. TokenMixer-Large successfully scales its parameters to 7-billion and 15-billion on online traffic and offline experiments, respectively. Currently deployed in multiple scenarios at ByteDance, TokenMixer-Large has achieved significant offline and online performance gains, delivering an increase of +1.66\% in orders and +2.98\% in per-capita preview payment GMV for e-commerce, improving ADSS by +2.0\% in advertising and achieving a +1.4\% revenue growth for live streaming.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.06563 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.06563 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.06563 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.