Paper Detail

RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action ModelsRoboAlign：视觉-语言-动作模型中语言-动作对齐的测试时推理学习

cs.CVCVTransformer热门获取具身智能多模态

RoboAlign Authors

2026年03月23日

arXiv: 2603.21341v1

作者人数

1

标签数量

5

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

This paper presents RoboAlign, a systematic framework for improving vision-language-action models (VLAs) by enhancing embodied reasoning capabilities in multimodal large language models (MLLMs). The key innovation involves sampling action tokens through zero-shot natural language reasoning and refining them using reinforcement learning to improve action accuracy. The approach effectively bridges the modality gap between language understanding and low-level robot actions, facilitating knowledge transfer from multimodal LLMs to embodied agents. Experimental validation demonstrates that training VLAs with this framework leads to reliable performance improvements in robotic manipulation tasks.

摘要 / Abstract

分类 / Categories

深度分析