Paper Detail

Off-Policy Value-Based Reinforcement Learning for Large Language Models面向大语言模型的离策略价值函数强化学习方法

cs.CL大语言模型端到端Transformer热门获取

ReVal Team

2026年03月24日

arXiv: 2603.23355v1

作者人数

1

标签数量

4

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

This paper addresses the critical challenge of improving data utilization efficiency in reinforcement learning for large language models. The authors propose ReVal, a novel Bellman-update-based method that enables off-policy learning through combining stepwise consistency signals with trajectory-level outcome verification. By supporting replay-buffer-based training, ReVal allows efficient reuse of past trajectories, significantly improving sample efficiency compared to on-policy approaches. Experimental results on mathematical reasoning benchmarks, particularly with DeepSeek-R1-Distill-1.5B, demonstrate that ReVal achieves faster convergence and superior final performance, with improvements of 2.7% on AIME24.

摘要 / Abstract

分类 / Categories

深度分析