作者人数
标签数量
内容状态
原文 + 中文
同页查看标题和摘要的双语信息
PDF 预览
直接在详情页阅读或下载论文全文
深度分析
继续下钻到 AI 生成的结构化解读
摘要 / Abstract
Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. This paper proposes a reusable framework for holistic MoE architectural optimization that addresses limitations in existing approaches. The authors demonstrate that FLOPs per token alone is insufficient for fair comparison of MoE models because varying computational densities across layer types can inflate parameters without proportional compute cost. They establish a joint constraint triad of FLOPs per token, active parameters, and total parameters, then reduce the 16-dimensional architectural search space to two sequential low-dimensional phases through algebraic constraints.
大语言模型的缩放定律指导宏观资源分配,但由于设计空间组合爆炸巨大,将其转化为精确的MoE架构配置仍是开放问题。本文提出了一个可复用的整体MoE架构优化框架,以解决现有方法的局限性。作者证明,仅使用FLOPs/token无法公平比较MoE模型,因为不同层类型的计算密度差异会导致参数增加而计算成本不成比例。他们建立了FLOPs/token、活跃参数和总参数的联合约束三元组,并通过代数约束将16维架构搜索空间缩减为两个顺序的低维阶段。
分类 / Categories
深度分析
AI 深度理解论文内容,生成具有洞见性的总结