作者人数
标签数量
内容状态
原文 + 中文
同页查看标题和摘要的双语信息
PDF 预览
直接在详情页阅读或下载论文全文
深度分析
继续下钻到 AI 生成的结构化解读
摘要 / Abstract
Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction for more efficient LLM fine-tuning. However, the computational overhead of computing row-wise norms of the adapted weight matrices creates significant memory challenges, especially at high ranks and across hundreds of adapted modules. This work introduces a factored norm decomposition that eliminates dense matrix materialization by computing squared norms through base, cross, and Gram terms with O(d_out r + r^2) complexity. Additionally, fused Triton kernels combine the four-kernel DoRA composition into a single pass, achieving approximately 4x memory traffic reduction and numerical stability in near-unity rescaling regimes. These optimizations make high-rank DoRA feasible on common single-GPU setups for large language model adaptation.
权重分解低秩适应(DoRA)通过将权重幅度与方向解耦来扩展LoRA,以实现更高效的大型语言模型微调。然而,计算适应权重矩阵行范数的计算开销带来了显著的内存挑战,特别是在高秩和跨数百个适应模块的情况下。本工作引入了一种分解范数方法,通过基项、交叉项和Gram项计算平方范数,消除密集矩阵实例化,计算复杂度为O(d_out r + r^2)。此外,融合的Triton核将四核DoRA组合简化为单次传递,实现了约4倍的内存流量减少和接近统一重缩放区域的数值稳定性。这些优化使得在常见单GPU配置上进行大型语言模型适应时,高秩DoRA变得可行。
分类 / Categories
深度分析
AI 深度理解论文内容,生成具有洞见性的总结