作者人数
标签数量
内容状态
原文 + 中文
同页查看标题和摘要的双语信息
PDF 预览
直接在详情页阅读或下载论文全文
深度分析
继续下钻到 AI 生成的结构化解读
摘要 / Abstract
Vision-Language-Action (VLA) models integrate perception, language understanding, and motor control into unified architectures for robotic control. This study systematically investigates how VLAs process multimodal inputs to generate actions through activation injection, sparse autoencoders, and linear probes across six models (80M-7B parameters) using 394,000+ rollout episodes. The research reveals that the visual pathway is the dominant factor in action generation across all architectures, as injecting baseline activations into null-prompt episodes reproduces nearly identical behavior. Cross-task injection experiments demonstrate spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity is found to depend on task structure rather than model design: when visual context uniquely specifies the task, language is ignored, but when multiple goals share a scene, language becomes essential (achieving 94% accuracy in X-VLA libero_goal). These findings provide mechanistic insights into VLA operation and have implications for embodied AI systems requiring precise vision-language coordination for robotic manipulation.
分类 / Categories
深度分析
AI 深度理解论文内容,生成具有洞见性的总结