Paper Detail

Mind over Space: Can Multimodal Large Language Models Mentally Navigate?心灵超越空间：多模态大语言模型能否进行心理导航？

cs.CV自动驾驶CVTransformer热门获取多模态

Unknown

2026年03月23日

arXiv: 2603.21577v1

作者人数

1

标签数量

5

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

This paper addresses the limitation of Multimodal Large Language Models (MLLMs) in embodied agents, which struggle with spatial reasoning across extensive spatiotemporal scales. The authors introduce Video2Mental, a benchmark that evaluates mental navigation capabilities by requiring models to construct hierarchical cognitive maps from long egocentric videos and generate landmark-based path plans. The research draws inspiration from cognitive science, exploring how biological intelligence uses mental navigation and spatial simulation prior to action. Benchmarking results demonstrate that mental navigation capabilities do not naturally emerge from standard pre-training, with frontier MLLMs showing significant challenges in zero-shot scenarios. Planning accuracy is verified through simulator-based physical interaction, providing a comprehensive evaluation framework for embodied AI systems.

摘要 / Abstract

分类 / Categories

深度分析