作者人数
标签数量
内容状态
原文 + 中文
同页查看标题和摘要的双语信息
PDF 预览
直接在详情页阅读或下载论文全文
深度分析
继续下钻到 AI 生成的结构化解读
摘要 / Abstract
Large language models often exhibit selection bias in multiple-choice and pairwise evaluation tasks due to non-semantic factors such as option positions and label symbols. Existing inference-time debiasing methods are computationally expensive and may harm reasoning capabilities, while pointwise training approaches ignore the importance of consistent answers across different question permutations. This paper proposes Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which addresses selection bias by enforcing permutation-consistent semantic reasoning through two complementary mechanisms: cross-permutation advantage computation relative to the mean reward over all permutations, and consistency-aware reward that encourages stable decision-making across different permutations. The proposed approach effectively mitigates selection bias while preserving the model's reasoning capabilities.
大型语言模型在多项选择和成对评估任务中常因选项位置、标签符号等非语义因素而产生选择偏差。现有推理时去偏方法计算开销大且可能损害推理能力,而逐点训练方法忽略了跨置换问题保持答案一致性的重要性。本文提出置换感知分组相对策略优化(PA-GRPO),通过跨置换优势计算和一致性感知奖励两个互补机制实现置换一致的语义推理,有效减轻选择偏差同时保持模型推理能力。
分类 / Categories
深度分析
AI 深度理解论文内容,生成具有洞见性的总结