Paper Detail

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

cs.CVCVTransformer热门获取具身智能多模态

Research Team

2026年03月20日

arXiv: 2603.19233v1

作者人数

1

标签数量

5

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

Vision-Language-Action (VLA) models integrate perception, language understanding, and motor control into unified architectures for robotic control. This study systematically investigates how VLAs process multimodal inputs to generate actions through activation injection, sparse autoencoders, and linear probes across six models (80M-7B parameters) using 394,000+ rollout episodes. The research reveals that the visual pathway is the dominant factor in action generation across all architectures, as injecting baseline activations into null-prompt episodes reproduces nearly identical behavior. Cross-task injection experiments demonstrate spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity is found to depend on task structure rather than model design: when visual context uniquely specifies the task, language is ignored, but when multiple goals share a scene, language becomes essential (achieving 94% accuracy in X-VLA libero_goal). These findings provide mechanistic insights into VLA operation and have implications for embodied AI systems requiring precise vision-language coordination for robotic manipulation.

摘要 / Abstract

分类 / Categories

深度分析