具身智能

共 58 篇论文

cs.RO端到端CV

Speculative Policy Orchestration: A Latency-Resilient Framework for Cloud-Robotic Manipulation

This paper presents Speculative Policy Orchestration (SPO), a novel latency-resilient framework designed for cloud-robotics manipulation tasks. The proposed approach enables robots to offload computationally intensive motion planning to remote cloud servers while maintaining stable high-frequency control at the edge. By utilizing a cloud-hosted world model to pre-compute and stream kinematic waypoints, the system effectively decouples execution frequency from network round-trip latency. To ensure safe operation, an epsilon-tube verifier bounds kinematic execution errors, preventing unsafe predictive drift. Additionally, an Adaptive Horizon Scaling mechanism dynamically adjusts the speculative pre-fetch depth based on real-time tracking performance. The framework is validated through continuous manipulation experiments on RLBench under emulated network delay conditions.

Anonymous Authors

about 1 month ago

arXiv 2603.19418v1

cs.RO具身智能

SOFTMAP: Sim2Real Soft Robot Forward Modeling via Topological Mesh Alignment and Physics Prior

While soft robot manipulators offer compelling advantages over rigid counterparts, including inherent compliance, safe human-robot interaction, and the ability to conform to complex geometries, accurate forward modeling from low-dimensional actuation commands remains an open challenge due to nonlinear material phenomena such as hysteresis and manufacturing variability. We present SOFTMAP, a sim-to-real learning framework for real-time 3D forward modeling of tendon-actuated soft finger manipulators. SOFTMAP combines four components: (1) As-Rigid-As-Possible (ARAP)-based topological alignment that projects simulated and real point clouds into a shared, topologically consistent vertex space; (2) a lightweight MLP forward model pretrained on simulation data to map servo commands to full 3D finger geometry; (3) a residual correction network trained on a small set of real observations to predict per-vertex displacement fields that compensate for sim-to-real discrepancies; and (4) a closed-form linear actuation calibration layer enabling real-time inference at 30 FPS. We evaluate SOFTMAP on both simulated and physical hardware, achieving state-of-the-art shape prediction accuracy with a Chamfer distance of 0.389 mm in simulation and 3.786 mm on hardware, millimeter-level fingertip trajectory tracking across multiple target paths, and a 36.5% improvement in teleoperation task success over the baseline. Our results show that SOFTMAP provides a data-efficient approach for 3D forward modeling and control of soft manipulators.

about 1 month ago

arXiv 2603.19384v1

cs.CV端到端CV

VAMPO: Policy Optimization for Improving Visual Dynamics in Video Action Models

This paper presents VAMPO, a post-training framework that improves visual dynamics in video action models for robot control. The key contribution is formulating multi-step denoising as a sequential decision process and optimizing the denoising policy with rewards defined over expert visual dynamics in latent space. The approach addresses the objective mismatch in current diffusion-based video predictors by explicitly optimizing precision-critical visual dynamics needed for manipulation tasks.

about 1 month ago

arXiv 2603.19370v1

cs.CVCVTransformer

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

This paper addresses the spatial blindness problem in Multimodal Large Language Models by leveraging implicit 3D priors learned in video generation models. The authors propose VEGA-3D, a framework that repurposes pre-trained video diffusion models as latent world simulators to extract robust 3D structural priors and physical understanding. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations through token-level adaptive gated fusion, the method enriches MLLMs with dense geometric cues for improved scene understanding and geometric reasoning capabilities.

VEGA-3D Authors

about 1 month ago

arXiv 2603.19235v1

cs.CVCVTransformer

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Vision-Language-Action (VLA) models integrate perception, language understanding, and motor control into unified architectures for robotic control. This study systematically investigates how VLAs process multimodal inputs to generate actions through activation injection, sparse autoencoders, and linear probes across six models (80M-7B parameters) using 394,000+ rollout episodes. The research reveals that the visual pathway is the dominant factor in action generation across all architectures, as injecting baseline activations into null-prompt episodes reproduces nearly identical behavior. Cross-task injection experiments demonstrate spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity is found to depend on task structure rather than model design: when visual context uniquely specifies the task, language is ignored, but when multiple goals share a scene, language becomes essential (achieving 94% accuracy in X-VLA libero_goal). These findings provide mechanistic insights into VLA operation and have implications for embodied AI systems requiring precise vision-language coordination for robotic manipulation.

about 1 month ago

arXiv 2603.19233v1

cs.CV端到端CV

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

This paper presents MonoArt, a unified framework for reconstructing articulated 3D objects from single images through progressive structural reasoning. The method addresses the challenge of inferring object geometry, part structure, and motion parameters from limited visual evidence without direct articulation regression. By progressively transforming visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture, the framework enables stable and interpretable articulation inference without external templates or multi-stage pipelines. The approach is validated on articulated object datasets demonstrating effective 3D reconstruction of objects with movable parts.

about 1 month ago

arXiv 2603.19231v1

cs.CVCVTransformer

OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation

This paper presents OmniVTA, a world-model-based visuo-tactile manipulation framework designed for contact-rich robotic manipulation tasks such as wiping and assembly. The work introduces OmniViTac, a large-scale dataset comprising 21,000+ trajectories across 86 tasks and 100+ objects with six physics-grounded interaction patterns. The framework integrates four tightly coupled modules including a self-supervised tactile encoder and a two-stream visuo-tactile world model for predicting contact dynamics. The research addresses limitations in existing methods by treating tactile signals actively to model contact dynamics and enable explicit closed-loop control, moving beyond passive observation approaches.

OmniVTA Authors

about 1 month ago

arXiv 2603.19201v2

cs.CVCVTransformer

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation, combining visual perception, language understanding, and action planning in a unified framework. This work applies mechanistic interpretability techniques using Sparse Autoencoders (SAEs) to analyze hidden layer activations in VLA models, revealing sparse dictionary features that provide interpretable bases for model computation. The research discovers that most SAE features correspond to memorized sequences from training demonstrations, while some features represent interpretable, generalizable motion primitives and semantic properties. This analysis offers insights into VLA model generalizability and provides a framework for steering model behavior through identified interpretable features, advancing the understanding of embodied AI systems for robot manipulation tasks.

Anonymous Authors

about 1 month ago

arXiv 2603.19183v1

cs.RO具身智能

Tendon-Actuated Robots with a Tapered, Flexible Polymer Backbone: Design, Fabrication, and Modeling

This paper presents the design, modeling, and fabrication of 3D-printed, tendon-actuated continuum robots featuring a flexible, tapered backbone constructed from thermoplastic polyurethane (TPU). Our scalable design incorporates an integrated electronics base housing that enables direct tendon tension control and sensing via actuators and compression load cells. Unlike many continuum robots that are single-purpose and costly, the proposed design prioritizes customizability, rapid assembly, and low cost while enabling high curvature and enhanced distal compliance through geometric tapering, thereby supporting a broad range of compliant robotic inspection and manipulation tasks. We develop a generalized forward kinetostatic model of the tapered backbone based on Cosserat rod theory using a Newtonian approach, extending existing tendon-actuated Cosserat rod formulations to explicitly account for spatially varying backbone cross-sectional geometry. The model captures the graded stiffness profile induced by the tapering and enables systematic exploration of the configuration space as a function of the geometric design parameters. Specifically, we analyze how the backbone taper angle influences the robot's configuration space and manipulability. The model is validated against motion capture data, achieving centimeter-level shape prediction accuracy after calibrating Young's modulus via a line search that minimizes modeling error. We further demonstrate teleoperated grasping using an endoscopic gripper routed along the continuum robot, mounted on a 6-DoF robotic arm. Parameterized iLogic/CAD scripts are provided for rapid geometry generation and scaling. The presented framework establishes a simple, rapid, and reproducible pathway from parametric design to controlled tendon actuation for tapered, tendon-driven continuum robots manufactured using fused deposition modeling 3D printers.

Harald Minde Hansen +5

about 1 month ago

arXiv 2603.19124v1

cs.RO具身智能

Articulated-Body Dynamics Network: Dynamics-Grounded Prior for Robot Learning

Recent work in reinforcement learning has shown that incorporating structural priors for articulated robots, such as link connectivity, into policy networks improves learning efficiency. However, dynamics properties, despite their fundamental role in determining how forces and motion propagate through the body, remain largely underexplored as an inductive bias for policy learning. To address this gap, we present the Articulated-Body Dynamics Network (ABD-Net), a novel graph neural network architecture grounded in the computational structure of forward dynamics. Specifically, we adapt the inertia propagation mechanism from the Articulated Body Algorithm, systematically aggregating inertial quantities from child to parent links in a tree-structured manner, while replacing physical quantities with learnable parameters. Embedding ABD-NET into the policy actor enables dynamics-informed representations that capture how actions propagate through the body, leading to efficient and robust policy learning. Through experiments with simulated humanoid, quadruped, and hopper robots, our approach demonstrates increased sample efficiency and generalization to dynamics shifts compared to transformer-based and GNN baselines. We further validate the learned policy on real Unitree G1 and Go2 robots, state-of-the-art humanoid and quadruped platforms, generating dynamic, versatile and robust locomotion behaviors through sim-to-real transfer with real-time inference.

Sangwoo Shin +3

about 1 month ago

arXiv 2603.19078v2

cs.CV端到端CV

ATG-MoE: Autoregressive trajectory generation with mixture-of-experts for assembly skill learning

This paper presents ATG-MoE, an end-to-end autoregressive trajectory generation method with mixture-of-experts architecture for robot assembly skill learning from demonstration. The method processes multi-modal inputs including RGB-D observations, natural language instructions, and robot proprioception to generate manipulation trajectories in a closed-loop manner. It incorporates multi-modal feature fusion for comprehensive scene and task understanding, autoregressive sequence modeling for temporally coherent trajectory generation, and a mixture-of-experts architecture enabling unified multi-skill learning. The approach addresses challenges in flexible manufacturing where robot systems must adapt to changing tasks, objects, and environments without labor-intensive traditional programming.

ATG-MoE Authors

about 1 month ago

arXiv 2603.19029v1

cs.RO具身智能

PRIOR: Perceptive Learning for Humanoid Locomotion with Reference Gait Priors

Training perceptive humanoid locomotion policies that traverse complex terrains with natural gaits remains an open challenge, typically demanding multi-stage training pipelines, adversarial objectives, or extensive real-world calibration. We present PRIOR, an efficient and reproducible framework built on Isaac Lab that achieves robust terrain traversal with human-like gaits through a simple yet effective design: (i) a parametric gait generator that supplies stable reference trajectories derived from motion capture without adversarial training, (ii) a GRU-based state estimator that infers terrain geometry directly from egocentric depth images via self-supervised heightmap reconstruction, and (iii) terrain-adaptive footstep rewards that guide foot placement toward traversable regions. Through systematic analysis of depth image resolution trade-offs, we identify configurations that maximize terrain fidelity under real-time constraints, substantially reducing perceptual overhead without degrading traversal performance. Comprehensive experiments across terrains of varying difficulty-including stairs, boxes, and gaps-demonstrate that each component yields complementary and essential performance gains, with the full framework achieving a 100% traversal success rate. We will open-source the complete PRIOR framework, including the training pipeline, parametric gait generator, and evaluation benchmarks, to serve as a reproducible foundation for humanoid locomotion research on Isaac Lab.

about 1 month ago

arXiv 2603.18979v1

cs.CVCV3D检测

GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting

This paper presents GHOST (Gaussian Hand-Object Splatting), a fast category-agnostic framework for reconstructing dynamic hand-object interactions from monocular RGB videos. The method represents both hands and objects as dense, view-consistent Gaussian discs to achieve complete 3D reconstructions. Three key innovations are introduced: a geometric-prior retrieval and consistency loss for completing occluded object regions, grasp-aware alignment for refining hand translations and object scale to ensure realistic contact, and a hand-aware background loss that prevents penalizing hand-occluded object regions. The framework enables physically consistent and animatable reconstructions while running an order of magnitude faster than existing methods, with applications in AR/VR, robotics, and embodied AI.

Anonymous Authors

about 1 month ago

arXiv 2603.18912v1

cs.CVCVTransformer

MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

Spatial reasoning is foundational for Vision-Language Models deployed as Vision-Language-Action agents in physical environments. This paper introduces MultihopSpatial, a benchmark designed for multi-hop and compositional spatial reasoning with complex queries across diverse spatial perspectives. The work proposes Acc@50IoU, a metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction. MultihopSpatial-Train provides a large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields key insights into the capabilities and limitations of current models for robust VLA deployment in embodied AI scenarios.

MultihopSpatial Team

about 1 month ago

arXiv 2603.18892v1

cs.CVCV3D检测

V-Dreamer: Automating Robotic Simulation and Trajectory Synthesis via Video Generation Priors

This paper presents V-Dreamer, a fully automated framework for generating simulation-ready manipulation environments and executable robot trajectories from natural language instructions. The system leverages large language models and 3D generative models to construct physically grounded 3D scenes validated by geometric constraints for stable, collision-free layouts. Video generation models serve as rich motion priors for behavior synthesis, which are then mapped to executable robot trajectories through a Sim-to-Gen visual-kinematic alignment module using CoTracker3 and VGGT. This approach enables high visual diversity and physical fidelity for robotic manipulation training.

about 1 month ago

arXiv 2603.18811v1

cs.CVCV目标检测

Robotic Agentic Platform for Intelligent Electric Vehicle Disassembly

This paper presents RAPID, a Robotic Agentic Platform for Intelligent Disassembly designed to address the challenge of scalable EV battery recycling. The system features a gantry-mounted industrial manipulator with RGB-D perception capabilities and an automated nut-running tool for fastener removal on full-scale EV battery packs. An open-vocabulary object detection pipeline achieves 0.9757 mAP50 for reliable identification of screws, nuts, busbars, and components. The research experimentally evaluates three one-shot fastener removal strategies: taught-in poses (97% success, 24 min), one-shot vision execution (57%, 29 min), and visual servoing (83%, 36 min), providing a comprehensive comparison of success rates and disassembly times for battery top cover fasteners.

RAPID Research Team

about 1 month ago

arXiv 2603.18520v1

cs.RO端到端Transformer

RoboForge: Physically Optimized Text-guided Whole-Body Locomotion for Humanoids

This paper presents RoboForge, a unified latent-driven framework for text-guided whole-body humanoid locomotion. The approach bridges natural language and physical robot execution through a retarget-free pipeline that couples motion generation and control bidirectionally. A Physical Plausibility Optimization module serves as the coupling interface, refining teacher-student distillation policies with plausibility-centric rewards to eliminate physical artifacts like floating, skating, and penetration. This work enables humanoid robots to execute natural language-directed motions with improved physical feasibility.

RoboForge Authors

about 1 month ago

arXiv 2603.17927v2

cs.CVCV3D检测

MGSO: Monocular Real-time Photometric SLAM with Efficient 3D Gaussian Splatting

This paper presents MGSO (Monocular Gaussian Splatting Optimization), a novel real-time SLAM system that integrates photometric SLAM with efficient 3D Gaussian Splatting for dense 3D reconstruction. The proposed approach leverages photometric SLAM to generate dense structured point clouds that accelerate 3D Gaussian initialization and optimization. By producing more efficient maps with fewer Gaussians while maintaining reconstruction quality, the system achieves an excellent balance between quality, memory efficiency, and speed. Experiments demonstrate that MGSO outperforms state-of-the-art 3DGS-based SLAM systems, making it particularly suitable for real-time dense mapping on resource-limited devices.

over 1 year ago

arXiv 2409.13055v3

上一页第 3 / 3 页