作者人数
标签数量
内容状态
原文 + 中文
同页查看标题和摘要的双语信息
PDF 预览
直接在详情页阅读或下载论文全文
深度分析
继续下钻到 AI 生成的结构化解读
摘要 / Abstract
Spatial reasoning is foundational for Vision-Language Models deployed as Vision-Language-Action agents in physical environments. This paper introduces MultihopSpatial, a benchmark designed for multi-hop and compositional spatial reasoning with complex queries across diverse spatial perspectives. The work proposes Acc@50IoU, a metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction. MultihopSpatial-Train provides a large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields key insights into the capabilities and limitations of current models for robust VLA deployment in embodied AI scenarios.
空间推理是视觉-语言-动作智能体中视觉-语言模型的基础能力。本研究提出MultihopSpatial基准,用于多跳组合式空间推理,支持跨多样空间视角的复杂查询。该基准引入Acc@50IoU指标,通过同时要求答案选择和精确边界框预测来评估推理与视觉定位能力。MultihopSpatial-Train提供大规模训练语料以促进空间智能发展。对37个先进视觉-语言模型的广泛评估揭示了当前模型在具身智能场景中稳健部署的能力与局限。
分类 / Categories
深度分析
AI 深度理解论文内容,生成具有洞见性的总结