Paper Detail

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding生成模型理解空间：释放隐式3D先验用于场景理解

cs.CVCVTransformer热门获取具身智能多模态

VEGA-3D Authors

2026年03月20日

arXiv: 2603.19235v1

作者人数

1

标签数量

5

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

This paper addresses the spatial blindness problem in Multimodal Large Language Models by leveraging implicit 3D priors learned in video generation models. The authors propose VEGA-3D, a framework that repurposes pre-trained video diffusion models as latent world simulators to extract robust 3D structural priors and physical understanding. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations through token-level adaptive gated fusion, the method enriches MLLMs with dense geometric cues for improved scene understanding and geometric reasoning capabilities.

摘要 / Abstract

分类 / Categories

深度分析