Paper Detail

Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO通过置换感知GRPO减轻大型语言模型中的选择偏差

cs.CL大语言模型端到端Transformer热门获取

PA-GRPO Authors

2026年03月22日

arXiv: 2603.21016v1

作者人数

1

标签数量

4

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

Large language models often exhibit selection bias in multiple-choice and pairwise evaluation tasks due to non-semantic factors such as option positions and label symbols. Existing inference-time debiasing methods are computationally expensive and may harm reasoning capabilities, while pointwise training approaches ignore the importance of consistent answers across different question permutations. This paper proposes Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which addresses selection bias by enforcing permutation-consistent semantic reasoning through two complementary mechanisms: cross-permutation advantage computation relative to the mean reward over all permutations, and consistency-aware reward that encourages stable decision-making across different permutations. The proposed approach effectively mitigates selection bias while preserving the model's reasoning capabilities.

摘要 / Abstract

分类 / Categories

深度分析