具身智能

共 58 篇论文

cs.CVCVTransformer

ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

ABot-PhysWorld is a 14B Diffusion Transformer model designed for interactive world modeling in robotics that generates visually realistic, physically plausible, and action-controllable videos. The model addresses common physical implausibility issues like object penetration and anti-gravity motion by using a novel DPO-based post-training framework with decoupled discriminators trained on a curated dataset of three million physics-aware manipulation clips. A parallel context block enables precise spatial action injection for cross-embodiment robot control. To evaluate generalization, the system introduces EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations.

ABot Team

26 days ago

arXiv 2603.23376v1

cs.CVCVTransformer

Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models

This paper investigates Sim-to-Real generalization for dexterous manipulation tasks using Vision-Language-Action (VLA) models. The study empirically examines key factors affecting transfer from simulation to real-world deployment, including multi-level domain randomization, photorealistic rendering, physics-realistic modeling, and reinforcement learning updates. A comprehensive evaluation protocol is designed to quantify real-world manipulation performance, providing insights for developing generalist robot control policies that can effectively bridge the simulation-to-reality gap in dexterous manipulation scenarios.

Anonymous Authors

26 days ago

arXiv 2603.22876v1

cs.CVCVTransformer

DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

Vision-Language-Action (VLA) models enable robots to map visual observations and language instructions directly to robotic actions. However, existing VLA models struggle with complex multi-step tasks requiring logical planning and precise manipulations. Current Chain-of-Thought approaches have limitations in simultaneously capturing low-level visual details and high-level logical planning, as well as suffering from high inference latency with compounding errors. This paper proposes DualCoT-VLA, a novel visual-linguistic CoT method with parallel reasoning mechanism that integrates visual CoT for comprehensive multi-modal reasoning, enabling robots to perform effective thinking before acting for manipulation tasks.

DualCoT-VLA Authors

27 days ago

arXiv 2603.22280v1

cs.CVCV3D检测

UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos

This paper presents UniDex, a robot foundation suite addressing the challenges of dexterous manipulation by combining a large-scale robot-centric dataset with a unified vision-language-action (VLA) policy. The system transforms egocentric human videos into robot-executable trajectories across eight dexterous hand embodiments through a human-in-the-loop retargeting procedure. By operating on explicit 3D pointclouds with human hands masked, the approach narrows kinematic and visual gaps between human and robot domains. The introduced Function-Actuator-Aligned Space (FAAS) provides a unified action space for universal dexterous hand control.

UniDex Team

27 days ago

arXiv 2603.22264v1

cs.RO端到端CV

DexDrummer: In-Hand, Contact-Rich, and Long-Horizon Dexterous Robot Drumming

This paper addresses the challenging problem of in-hand, contact-rich, and long-horizon dexterous robot manipulation by proposing drumming as a comprehensive testbed. The DexDrummer framework employs a hierarchical object-centric bimanual policy that combines trajectory planning with residual reinforcement learning corrections, enabling effective sim-to-real transfer. The approach specifically targets dexterous manipulation skills including in-hand control for drumstick stabilization, contact-rich striking interactions, and long-horizon rhythmic coordination across multiple drums. By integrating these three challenging aspects into a single complex task, this work advances the field of robotic dexterity and manipulation planning.

DexDrummer Team

27 days ago

arXiv 2603.22263v1

cs.RO端到端CV

Make Tracking Easy: Neural Motion Retargeting for Humanoid Whole-body Control

This paper addresses the challenge of transferring human motion data to humanoid robots by proposing Neural Motion Retargeting (NMR), a novel framework that transforms static geometric mapping into a dynamics-aware learned process. The approach uses Clustered-Expert Physics Refinement (CEPR) with VAE-based motion clustering to group heterogeneous movements into latent motifs, significantly reducing computational overhead for reinforcement learning experts that project and repair noisy human motion data. Through Hessian analysis, the authors demonstrate that traditional optimization-based retargeting is inherently non-convex and prone to local optima, leading to physical artifacts. By reformulating the problem as learning data distribution rather than optimizing solutions, the framework achieves smooth, physically plausible whole-body robot control.

NMR Research Team

27 days ago

arXiv 2603.22201v1

cs.RO端到端CV

ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling

This paper presents ROBOGATE, a deployment risk management framework for safe robot policy deployment in industrial settings. The framework combines physics-based simulation with a two-stage adaptive sampling strategy to efficiently discover failure boundaries in high-dimensional operational parameter spaces. Stage 1 uses Latin Hypercube Sampling across an 8-dimensional parameter space, while Stage 2 applies boundary-focused sampling in the 30-70% success rate transition zone. Evaluated using NVIDIA Isaac Sim with Newton physics on Franka Panda and UR5e robots performing pick-and-place tasks across 30,000 experiments, the system employs logistic regression for risk modeling to ensure safe robot manipulation policy deployment.

Robogate Research Team

27 days ago

arXiv 2603.22126v1

cs.RO端到端具身智能

Sim-to-Real of Humanoid Locomotion Policies via Joint Torque Space Perturbation Injection

This paper presents a novel sim-to-real approach for training humanoid robot locomotion policies by injecting state-dependent perturbations into joint torque space during simulation. Unlike traditional domain randomization methods that randomize fixed parameters, the proposed approach uses neural networks to generate complex, state-dependent perturbations that simulate nonlinear actuator dynamics and contact compliance. The method achieves superior robustness against unseen reality gaps, demonstrating successful transfer from simulation to real-world humanoid deployment without requiring additional training. Experimental validation confirms that policies trained with this perturbation injection technique can handle complex real-world scenarios that standard randomization cannot capture.

Anonymous Authors

27 days ago

arXiv 2603.21853v1

cs.AITransformer具身智能

The Presupposition Problem in Representation Genesis

Large language models represent the first systems to achieve high cognitive performance without clearly undergoing representation genesis - the transition from non-representing physical systems to content-sensitive behavior-guiding states. This paper examines the genesis question in LLMs, investigating which cognitive capacities are affected if genesis did not occur. The authors argue that major frameworks in philosophy of mind, including the Language of Thought hypothesis, teleosemantics, predictive processing, enactivism, and genetic phenomenology, share a common structural feature when addressing the genesis question, lacking the conceptual resources to fully explain LLM representation. The paper provides a theoretical analysis of how contemporary AI systems challenge traditional assumptions about representation and cognition.

Unknown

27 days ago

arXiv 2603.21745v1

cs.CVCVTransformer

Efficient Zero-Shot AI-Generated Image Detection

This paper addresses the critical challenge of detecting AI-generated images produced by text-to-image models. The proposed method is training-free and measures representation sensitivity to structured frequency perturbations, enabling detection of subtle manipulations between real and synthetic images. The approach uses only a single Fourier transform for perturbation generation, making it computationally lightweight and achieving one to two orders of magnitude faster inference than existing training-free detectors. Extensive experiments on challenging benchmarks including the OpenFake benchmark demonstrate superior performance over state-of-the-art methods.

Anonymous

27 days ago

arXiv 2603.21619v1

cs.CV具身智能

AdaEdit: Adaptive Temporal and Channel Modulation for Flow-Based Image Editing

Inversion-based image editing in flow matching models has emerged as a powerful paradigm for training-free, text-guided image manipulation. A central challenge in this paradigm is the injection dilemma: injecting source features during denoising preserves the background of the original image but simultaneously suppresses the model's ability to synthesize edited content. Existing methods address this with fixed injection strategies -- binary on/off temporal schedules, uniform spatial mixing ratios, and channel-agnostic latent perturbation -- that ignore the inherently heterogeneous nature of injection demand across both the temporal and channel dimensions. In this paper, we present AdaEdit, a training-free adaptive editing framework that resolves this dilemma through two complementary innovations. First, we propose a Progressive Injection Schedule that replaces hard binary cutoffs with continuous decay functions (sigmoid, cosine, or linear), enabling a smooth transition from source-feature preservation to target-feature generation and eliminating feature discontinuity artifacts. Second, we introduce Channel-Selective Latent Perturbation, which estimates per-channel importance based on the distributional gap between the inverted and random latents and applies differentiated perturbation strengths accordingly -- strongly perturbing edit-relevant channels while preserving structure-encoding channels. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing types) demonstrate that AdaEdit achieves an 8.7% reduction in LPIPS, a 2.6% improvement in SSIM, and a 2.3% improvement in PSNR over strong baselines, while maintaining competitive CLIP similarity. AdaEdit is fully plug-and-play and compatible with multiple ODE solvers including Euler, RF-Solver, and FireFlow. Code is available at https://github.com/leeguandong/AdaEdit

Guandong Li +1

27 days ago

arXiv 2603.21615v1

cs.CVCVTransformer

VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection

This paper presents VIGIL, a novel part-centric structured forensic framework for deepfake detection using multimodal large language models. The approach employs a plan-then-examine pipeline where the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination to ensure unbiased part selection. The framework is inspired by expert forensic practice and aims to improve the reliability of deepfake detection by separating evidence generation from manipulation localization, addressing the issue of hallucinated explanations in current MLLM-based methods.

VIGIL Authors

27 days ago

arXiv 2603.21526v1

cs.CVCVTransformer

RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

This paper presents RoboAlign, a systematic framework for improving vision-language-action models (VLAs) by enhancing embodied reasoning capabilities in multimodal large language models (MLLMs). The key innovation involves sampling action tokens through zero-shot natural language reasoning and refining them using reinforcement learning to improve action accuracy. The approach effectively bridges the modality gap between language understanding and low-level robot actions, facilitating knowledge transfer from multimodal LLMs to embodied agents. Experimental validation demonstrates that training VLAs with this framework leads to reliable performance improvements in robotic manipulation tasks.

RoboAlign Authors

28 days ago

arXiv 2603.21341v1

cs.RO端到端Transformer

Evaluating Factor-Wise Auxiliary Dynamics Supervision for Latent Structure and Robustness in Simulated Humanoid Locomotion

This paper evaluates whether factor-wise auxiliary dynamics supervision produces useful latent structure or improved robustness in simulated humanoid locomotion. We introduce DynaMITE, a transformer encoder with a factored 24-dimensional latent space trained using per-factor auxiliary losses during proximal policy optimization. Our method is compared against LSTM, plain Transformer, and MLP baselines on a Unitree G1 humanoid robot across four Isaac Lab tasks. Through comprehensive ablation studies with 10 random seeds, we analyze the contributions of tanh bottlenecks and auxiliary losses to in-distribution reward performance. Results demonstrate that the supervised latent fails to produce decodable or functionally separable factor structure, with probe R-squared near zero and minimal reward changes when subspaces are clamped.

Anonymous Authors

28 days ago

arXiv 2603.21268v1

cs.CV端到端CV

GAPG: Geometry Aware Push-Grasping Synergy for Goal-Oriented Manipulation in Clutter

Grasping target objects is a fundamental skill for robotic manipulation, but in cluttered environments with stacked or occluded objects, a single-step grasp is often insufficient. This paper proposes a geometry-aware push-grasp synergy framework that leverages point cloud data to integrate grasp and push evaluation. The grasp evaluation module analyzes the geometric relationship between the gripper's point cloud and the points enclosed within its closing region to determine grasp feasibility and stability. Guided by this analysis, the push evaluation module predicts how pushing actions influence future graspable space, enabling the robot to effectively manipulate objects in complex cluttered scenarios.

GAPG Authors

28 days ago

arXiv 2603.21195v1

cs.RO具身智能

Affordance-Guided Enveloping Grasp Demonstration Toward Non-destructive Disassembly of Pinch-Infeasible Mating Parts

Robotic disassembly of complex mating components often renders pinch grasping infeasible, necessitating multi-fingered enveloping grasps. However, visual occlusions and geometric constraints complicate teaching appropriate grasp motions when relying solely on 2D camera feeds. To address this, we propose an affordance-guided teleoperation method that pre-generates enveloping grasp candidates via physics simulation. These Affordance Templates (ATs) are visualized with a color gradient reflecting grasp quality to augment operator perception. Simulations demonstrate the method's generality across various components. Real-robot experiments validate that AT-based visual augmentation enables operators to effectively select and teach enveloping grasp strategies for real-world disassembly, even under severe visual and geometric constraints.

Masaki Tsutsumi +3

28 days ago

arXiv 2603.21143v1

cs.LG具身智能

Harmful Visual Content Manipulation Matters in Misinformation Detection Under Multimedia Scenarios

Nowadays, the widespread dissemination of misinformation across numerous social media platforms has led to severe negative effects on society. To address this challenge, the automatic detection of misinformation, particularly under multimedia scenarios, has gained significant attention from both academic and industrial communities, leading to the emergence of a research task known as Multimodal Misinformation Detection (MMD). Typically, current MMD approaches focus on capturing the semantic relationships and inconsistency between various modalities but often overlook certain critical indicators within multimodal content. Recent research has shown that manipulated features within visual content in social media articles serve as valuable clues for MMD. Meanwhile, we argue that the potential intentions behind the manipulation, e.g., harmful and harmless, also matter in MMD. Therefore, in this study, we aim to identify such multimodal misinformation by capturing two types of features: manipulation features, which represent if visual content has been manipulated, and intention features, which assess the nature of these manipulations, distinguishing between harmful and harmless intentions. Unfortunately, the manipulation and intention labels that supervise these features to be discriminative are unknown. To address this, we introduce two weakly supervised indicators as substitutes by incorporating supplementary datasets focused on image manipulation detection and framing two different classification tasks as positive and unlabeled learning issues. With this framework, we introduce an innovative MMD approach, titled Harmful Visual Content Manipulation Matters in MMD (HAVC-M4 D). Comprehensive experiments conducted on four prevalent MMD datasets indicate that HAVC-M4 D significantly and consistently enhances the performance of existing MMD methods.

Bing Wang +6

28 days ago

arXiv 2603.21054v1

cs.CVCVTransformer

Cortical Policy: A Dual-Stream View Transformer for Robotic Manipulation

View transformers process multi-view observations to predict actions and have shown impressive performance in robotic manipulation. Existing methods typically extract static visual representations in a view-specific manner, leading to inadequate 3D spatial reasoning ability and a lack of dynamic adaptation. Taking inspiration from how the human brain integrates static and dynamic views, we propose Cortical Policy, a novel dual-stream view transformer for robotic manipulation that jointly reasons from static-view and dynamic-view streams. The static-view stream enhances spatial understanding by aligning features of geometrically consistent keypoints extracted from a pretrained 3D foundation model. The dynamic-view stream achieves adaptive adjustment through position-aware pretraining of an egocentric gaze estimation model, computationally replicating the human cortical dorsal pathway. The complementary view representations enable improved robotic manipulation performance.

Cortical Policy Authors

28 days ago

arXiv 2603.21051v1

cs.CVCV3D检测

Geometrically Plausible Object Pose Refinement using Differentiable Simulation

This paper addresses the challenge of geometrically infeasible pose hypotheses in object pose estimation, particularly for dexterous manipulation scenarios. The authors propose a multi-modal approach combining differentiable physics simulation, differentiable rendering, and visuo-tactile sensing to refine object poses while ensuring physical consistency. The method significantly reduces intersection volume errors between objects and robotic hands by 73% under accurate initial estimates and over 87% under high uncertainty, outperforming ICP-based approaches. By integrating physical constraints into pose optimization, this work enables robots to achieve more reliable manipulation by ensuring estimated object poses respect both geometric accuracy and physical reality.

Anonymous Authors

28 days ago

arXiv 2603.20992v1

cs.AI端到端Transformer

Detection of adversarial intent in Human-AI teams using LLMs

Large language models (LLMs) are increasingly deployed in human-AI teams as support agents for complex tasks such as information retrieval, programming, and decision-making assistance. This paper studies the potential role of LLMs as defensive supervisors within mixed human-AI teams to detect malicious behavior. Using a dataset consisting of multi-party conversations and decisions over a 25-round horizon, we formulate the problem of malicious behavior detection from interaction traces. We find that LLMs are capable of identifying malicious behavior effectively, demonstrating their potential as defensive actors in collaborative environments.

Unknown

28 days ago

arXiv 2603.20976v1

第 1 / 3 页下一页