作者人数
标签数量
内容状态
原文 + 中文
同页查看标题和摘要的双语信息
PDF 预览
直接在详情页阅读或下载论文全文
深度分析
继续下钻到 AI 生成的结构化解读
摘要 / Abstract
This paper addresses the limitation of Multimodal Large Language Models (MLLMs) in embodied agents, which struggle with spatial reasoning across extensive spatiotemporal scales. The authors introduce Video2Mental, a benchmark that evaluates mental navigation capabilities by requiring models to construct hierarchical cognitive maps from long egocentric videos and generate landmark-based path plans. The research draws inspiration from cognitive science, exploring how biological intelligence uses mental navigation and spatial simulation prior to action. Benchmarking results demonstrate that mental navigation capabilities do not naturally emerge from standard pre-training, with frontier MLLMs showing significant challenges in zero-shot scenarios. Planning accuracy is verified through simulator-based physical interaction, providing a comprehensive evaluation framework for embodied AI systems.
本文探讨了多模态大语言模型(MLLMs)在具身智能体中的局限性,这些模型在大范围时空尺度上的空间推理方面存在困难。作者提出Video2Mental基准测试,通过要求模型从长时第一人称视频构建层级认知地图并生成基于地标路径规划来评估心理导航能力。基准测试结果表明,心理导航能力不会从标准预训练中自然涌现,前沿MLLMs在零样本场景中表现出显著挑战。
分类 / Categories
深度分析
AI 深度理解论文内容,生成具有洞见性的总结