Paper Detail

Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization面向最优混合专家架构优化的整体缩放定律

cs.CL大语言模型Transformer热门获取End-to-End

Anonymous

2026年03月23日

arXiv: 2603.21862v1

作者人数

1

标签数量

4

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. This paper proposes a reusable framework for holistic MoE architectural optimization that addresses limitations in existing approaches. The authors demonstrate that FLOPs per token alone is insufficient for fair comparison of MoE models because varying computational densities across layer types can inflate parameters without proportional compute cost. They establish a joint constraint triad of FLOPs per token, active parameters, and total parameters, then reduce the 16-dimensional architectural search space to two sequential low-dimensional phases through algebraic constraints.

摘要 / Abstract

分类 / Categories

深度分析