返回论文列表
Paper Detail
Mind over Space: Can Multimodal Large Language Models Mentally Navigate?心灵超越空间:多模态大语言模型能否进行心理导航?
cs.CV自动驾驶CVTransformer热门获取多模态
Unknown
2026年03月23日
arXiv: 2603.21577v1

作者人数

1

标签数量

5

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

This paper addresses the limitation of Multimodal Large Language Models (MLLMs) in embodied agents, which struggle with spatial reasoning across extensive spatiotemporal scales. The authors introduce Video2Mental, a benchmark that evaluates mental navigation capabilities by requiring models to construct hierarchical cognitive maps from long egocentric videos and generate landmark-based path plans. The research draws inspiration from cognitive science, exploring how biological intelligence uses mental navigation and spatial simulation prior to action. Benchmarking results demonstrate that mental navigation capabilities do not naturally emerge from standard pre-training, with frontier MLLMs showing significant challenges in zero-shot scenarios. Planning accuracy is verified through simulator-based physical interaction, providing a comprehensive evaluation framework for embodied AI systems.

本文探讨了多模态大语言模型(MLLMs)在具身智能体中的局限性,这些模型在大范围时空尺度上的空间推理方面存在困难。作者提出Video2Mental基准测试,通过要求模型从长时第一人称视频构建层级认知地图并生成基于地标路径规划来评估心理导航能力。基准测试结果表明,心理导航能力不会从标准预训练中自然涌现,前沿MLLMs在零样本场景中表现出显著挑战。

PDF 预览
1
在 arXiv 查看下载 PDF

分类 / Categories

cs.CVcs.AI

深度分析

AI 深度理解论文内容,生成具有洞见性的总结