返回论文列表
Paper Detail
When Negation Is a Geometry Problem in Vision-Language Models当否定成为视觉-语言模型中的几何问题
cs.CV热门获取具身智能
Fawaz Sammani, Tzoulio Chamiti, Paul Gavrikov, Nikos Deligiannis
2026年03月21日
arXiv: 2603.20554v1

作者人数

4

标签数量

2

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

Joint Vision-Language Embedding models such as CLIP typically fail at understanding negation in text queries - for example, failing to distinguish "no" in the query: "a plain blue shirt with no logos". Prior work has largely addressed this limitation through data-centric approaches, fine-tuning CLIP on large-scale synthetic negation datasets. However, these efforts are commonly evaluated using retrieval-based metrics that cannot reliably reflect whether negation is actually understood. In this paper, we identify two key limitations of such evaluation metrics and investigate an alternative evaluation framework based on Multimodal LLMs-as-a-judge, which typically excel at understanding simple yes/no questions about image content, providing a fair evaluation of negation understanding in CLIP models. We then ask whether there already exists a direction in the CLIP embedding space associated with negation. We find evidence that such a direction exists, and show that it can be manipulated through test-time intervention via representation engineering to steer CLIP toward negation-aware behavior without any fine-tuning. Finally, we test negation understanding on non-common image-text samples to evaluate generalization under distribution shifts.

CLIP等视觉-语言联合嵌入模型在理解文本查询中的否定时通常存在困难,例如无法正确区分查询“一件纯蓝色衬衫,没有标志”中“no”的含义。先前研究主要通过在大规模合成否定数据集上微调CLIP来解决这一问题,但这些方法通常依赖基于检索的评估指标,而这些指标无法可靠地反映模型是否真正理解了否定。本文识别了此类评估指标的两个关键局限性,并探索了一种基于多模态大语言模型评判者的替代评估框架,该框架擅长处理图像内容的简单是非问题,能够公平评估CLIP模型对否定的理解能力。此外,本文还发现CLIP嵌入空间中确实存在与否定相关的方向,并可通过表示工程技术进行测试时干预,使模型无需微调即可获得否定感知能力。

PDF 预览
1
在 arXiv 查看下载 PDF

分类 / Categories

cs.CVcs.AI

深度分析

AI 深度理解论文内容,生成具有洞见性的总结