Microsoft

Enterprise

company

Verified

https://www.microsoft.com/en-us/research/

microsoft

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

fepegar new activity about 7 hours ago

microsoft/colipri:Reproducing CT-RATE retrieval numbers

jmz-msft updated a model 3 days ago

microsoft/llava-rad

qianhuiwu submitted a paper 4 days ago

Orchard: An Open-Source Agentic Modeling Framework

View all activity

Papers

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

View all Papers

Articles

Differential Transformer V2

Jan 20

• 51

Introducing OptiMind, a research model designed for optimization

Jan 15

• 35

alvarobartt

posted an update about 6 hours ago

Post

Latest hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!

TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.

🧠 hf-mem now splits MoE memory into base model weights, routed experts, and KV cache
🏗️ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
⚡ Active params isn't the same as memory footprint, especially for sparse architectures
📦 Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
📚 KV cache can still dominate depending on context length, batch size, and concurrency
🔀 Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
🚀 Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving

Check the repository at https://github.com/alvarobartt/hf-mem