Paper Detail

The Dual Mechanisms of Spatial Reasoning in Vision-Language Models视觉-语言模型中空间推理的双重机制

cs.CV大语言模型CVTransformer热门获取多模态

Anonymous Authors

2026年03月24日

arXiv: 2603.22278v1

作者人数

1

标签数量

5

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

This paper investigates how vision-language models associate objects with their properties and spatial relations in multimodal tasks like image captioning and visual question answering. The research reveals that VLMs employ two concurrent mechanisms for spatial reasoning: the language model backbone represents content-independent spatial relations on visual tokens, while the vision encoder encodes object layouts that are directly utilized by the language model. The dominant spatial information originates from the vision encoder and is distributed globally across visual tokens, extending beyond object regions. These findings provide insights into the internal workings of multimodal models for spatial understanding.

摘要 / Abstract

分类 / Categories

深度分析