Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models
Vision-Language-Action (VLA) models integrate perception, language understanding, and motor control into unified architectures for robotic control. This study systematically investigates how VLAs process multimodal inputs to generate actions through activation injection, sparse autoencoders, and linear probes across six models (80M-7B parameters) using 394,000+ rollout episodes. The research reveals that the visual pathway is the dominant factor in action generation across all architectures, as injecting baseline activations into null-prompt episodes reproduces nearly identical behavior. Cross-task injection experiments demonstrate spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity is found to depend on task structure rather than model design: when visual context uniquely specifies the task, language is ignored, but when multiple goals share a scene, language becomes essential (achieving 94% accuracy in X-VLA libero_goal). These findings provide mechanistic insights into VLA operation and have implications for embodied AI systems requiring precise vision-language coordination for robotic manipulation.