返回论文列表
Paper Detail
Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs面向胃肠道诊断的多模态大语言模型临床认知对齐方法
cs.CV大语言模型热门获取
Huan Zheng, Yucheng Zhou, Tianyi Yan, Dubing Chen, Hongbo Lu, Wenlong Liao, Tao He, Pai Peng, Jianbing Shen
2026年03月21日
arXiv: 2603.20698v1

作者人数

9

标签数量

2

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.

多模态大语言模型(MLLMs)在医学影像分析中展现出巨大潜力,但在胃肠道内镜应用中面临模型推理与临床认知路径不对齐、视觉特征与诊断结果缺乏因果关联两大关键限制。本文提出临床认知对齐(CogAlign)框架,通过构建分层临床认知数据集并采用监督微调(SFT)将专家分层诊断逻辑内化到模型中,并设计反事实驱动的强化学习策略消除视觉偏差,使诊断严格基于因果病变特征。实验表明该方法在多个基准上达到最优性能(SOTA),显著提升复杂临床场景中的诊断准确性。

PDF 预览
1
在 arXiv 查看下载 PDF

分类 / Categories

cs.CVcs.AI

深度分析

AI 深度理解论文内容,生成具有洞见性的总结