Paper Detail

Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe揭秘长时域工具使用智能体的强化学习：综合方案

cs.AI大语言模型端到端Transformer热门获取

TravelPlanner Research Team

2026年03月23日

arXiv: 2603.21972v1

作者人数

1

标签数量

4

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

This paper presents a systematic empirical study on scaling Reinforcement Learning for Large Language Model agents in complex, multi-turn environments. Using TravelPlanner as a testbed, the authors decompose the agentic RL design space along five critical axes including reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Their controlled experiments reveal key insights: reward and algorithm choices are scale-dependent with smaller models benefiting from staged rewards while larger models converge with simpler dense rewards, approximately 1K training samples with balanced difficulty mixture represents the optimal training budget, and environmental stability is crucial for preventing policy degradation. The work provides a practical recipe for developing autonomous LLM agents capable of long-horizon tool orchestration and planning.

摘要 / Abstract

分类 / Categories

深度分析