作者人数
标签数量
内容状态
原文 + 中文
同页查看标题和摘要的双语信息
PDF 预览
直接在详情页阅读或下载论文全文
深度分析
继续下钻到 AI 生成的结构化解读
摘要 / Abstract
This paper presents RoboAlign, a systematic framework for improving vision-language-action models (VLAs) by enhancing embodied reasoning capabilities in multimodal large language models (MLLMs). The key innovation involves sampling action tokens through zero-shot natural language reasoning and refining them using reinforcement learning to improve action accuracy. The approach effectively bridges the modality gap between language understanding and low-level robot actions, facilitating knowledge transfer from multimodal LLMs to embodied agents. Experimental validation demonstrates that training VLAs with this framework leads to reliable performance improvements in robotic manipulation tasks.
本文提出 RoboAlign,一个通过增强多模态大语言模型(MLLMs)具身推理能力来改进视觉-语言-动作模型(VLAs)的系统性框架。核心创新在于通过零样本自然语言推理采样动作token,并利用强化学习对其进行优化以提升动作准确性。该方法有效弥合了语言理解与底层机器人动作之间的模态差距,促进了多模态大语言模型向具身智能体的知识迁移。实验验证表明,采用该框架训练的 VLAs 在机器人操作任务中实现了显著的性能提升。
分类 / Categories
深度分析
AI 深度理解论文内容,生成具有洞见性的总结