作者人数
标签数量
内容状态
原文 + 中文
同页查看标题和摘要的双语信息
PDF 预览
直接在详情页阅读或下载论文全文
深度分析
继续下钻到 AI 生成的结构化解读
摘要 / Abstract
This paper addresses the critical challenge of improving data utilization efficiency in reinforcement learning for large language models. The authors propose ReVal, a novel Bellman-update-based method that enables off-policy learning through combining stepwise consistency signals with trajectory-level outcome verification. By supporting replay-buffer-based training, ReVal allows efficient reuse of past trajectories, significantly improving sample efficiency compared to on-policy approaches. Experimental results on mathematical reasoning benchmarks, particularly with DeepSeek-R1-Distill-1.5B, demonstrate that ReVal achieves faster convergence and superior final performance, with improvements of 2.7% on AIME24.
本文针对大语言模型强化学习中提高数据利用效率这一关键挑战,提出了一种名为ReVal的新型基于Bellman更新的方法,通过结合逐步一致性信号与轨迹级结果验证实现离策略学习。该方法支持基于回放缓冲的训练,可高效重用历史轨迹,显著提升样本效率。在数学推理基准测试上的实验结果表明,ReVal在DeepSeek-R1-Distill-1.5B模型上实现了更快的收敛速度和更优的最终性能,在AIME24上提升2.7%。
分类 / Categories
深度分析
AI 深度理解论文内容,生成具有洞见性的总结