作者人数
标签数量
内容状态
原文 + 中文
同页查看标题和摘要的双语信息
PDF 预览
直接在详情页阅读或下载论文全文
深度分析
继续下钻到 AI 生成的结构化解读
摘要 / Abstract
This paper addresses the spatial blindness problem in Multimodal Large Language Models by leveraging implicit 3D priors learned in video generation models. The authors propose VEGA-3D, a framework that repurposes pre-trained video diffusion models as latent world simulators to extract robust 3D structural priors and physical understanding. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations through token-level adaptive gated fusion, the method enriches MLLMs with dense geometric cues for improved scene understanding and geometric reasoning capabilities.
本文针对多模态大语言模型(MLLMs)的空间感知缺陷问题,提出利用视频生成模型中学习到的隐式3D先验进行解决。作者设计了VEGA-3D框架,将预训练视频扩散模型重新构建为潜在世界模拟器,以提取鲁棒的3D结构先验和物理理解能力。该方法通过从中间噪声层级提取时空特征,并借助token级自适应门控融合机制将其与语义表征整合,从而为MLLMs注入密集的几何线索,显著提升场景理解与几何推理能力。
分类 / Categories
深度分析
AI 深度理解论文内容,生成具有洞见性的总结