具身智能

共 58 篇论文

cs.CL端到端Transformer

Do LLM-Driven Agents Exhibit Engagement Mechanisms? Controlled Tests of Information Load, Descriptive Norms, and Popularity Cues

Large language models enable increasingly expressive agent-based simulations, but pose methodological challenges regarding behavioral validity. This paper evaluates LLM-driven simulation credibility through a social media test case examining information engagement. Using a Weibo-like environment, the study systematically manipulates information load and descriptive norms while allowing popularity cues to evolve endogenously. The research tests whether simulated user behavior responds systematically to theoretical constructs rather than producing merely plausible outputs. Findings indicate that engagement responds systematically to information load and descriptive norms, with sensitivity to popularity cues varying across contexts. The paper discusses methodological implications for simulation-based communication research, particularly for multi-condition experimental designs involving LLM-driven agents.

Anonymous Authors

about 1 month ago

arXiv 2603.20911v1

cs.RO具身智能

Enhancing Vision-Based Policies with Omni-View and Cross-Modality Knowledge Distillation for Mobile Robots

Vision-based policies are widely applied in robotics for tasks such as manipulation and locomotion. On lightweight mobile robots, however, they face a trilemma of limited scene transferability, restricted onboard computation resources, and sensor hardware cost. To address these issues, we propose a knowledge distillation approach that transfers knowledge from an information-rich, appearance invariant omniview depth policy to a lightweight monocular policy. The key idea is to train the student not only to mimic the expert actions but also to align with the latent embeddings of the omni view depth teacher. Experiments demonstrate that omni-view and depth inputs improve the scene transfer and navigation performance, and that the proposed distillation method enhances the performance of a singleview monocular policy, compared with policies solely imitating actions. Real world experiments further validate the effectiveness and practicality of our approach. Code will be released publicly.

about 1 month ago

arXiv 2603.20679v1

cs.RO具身智能

StageCraft: Execution Aware Mitigation of Distractor and Obstruction Failures in VLA Models

Large scale pre-training on text and image data along with diverse robot demonstrations has helped Vision Language Action models (VLAs) to generalize to novel tasks, objects and scenes. However, these models are still susceptible to failure in the presence of execution-time impediments such as distractors and physical obstructions in the robot's workspace. Existing policy improvement methods finetune base VLAs to improve generalization, yet they still struggle in unseen distractor settings. To address this problem, we investigate whether internet-scale pretraining of large vision-language models (VLMs) can be leveraged to reason about these impediments and mitigate policy failures. To this end, we propose StageCraft, a training-free approach to improve pretrained VLA policy performance by manipulating the environment's initial state using VLM-based in-context reasoning. StageCraft takes policy rollout videos and success labels as input and leverages VLM's reasoning ability to infer which objects in the initial state need to be manipulated to avoid anticipated execution failures. StageCraft is an extensible plug-and-play module that does not introduce additional constraints on the underlying policy, and only requires a few policy rollouts to work. We evaluate performance of state-of-the-art VLA models with StageCraft and show an absolute 40% performance improvement across three real world task domains involving diverse distractors and obstructions. Our simulation experiments in RLBench empirically show that StageCraft tailors its extent of intervention based on the strength of the underlying policy and improves its performance with more in-context samples. Videos of StageCraft in effect can be found at https://stagecraft-decorator.github.io/stagecraft/ .

Kartikay Milind Pangaonkar +3

about 1 month ago

arXiv 2603.20659v1

cs.CV端到端CV

Speedup Patch: Learning a Plug-and-Play Policy to Accelerate Embodied Manipulation

This paper presents Speedup Patch (SuP), a lightweight policy-agnostic framework designed to accelerate embodied manipulation by adaptively downsampling action chunks from existing policies. The method formulates the scheduler optimization as a Constrained Markov Decision Process to maximize efficiency while maintaining task performance. To address offline safety constraints, the approach introduces World Model based state deviation as a surrogate metric for success evaluation. SuP demonstrates that embodied manipulation tasks can be significantly accelerated without requiring policy retraining or costly online interactions.

about 1 month ago

arXiv 2603.20658v1

cs.CV具身智能

When Negation Is a Geometry Problem in Vision-Language Models

Joint Vision-Language Embedding models such as CLIP typically fail at understanding negation in text queries - for example, failing to distinguish "no" in the query: "a plain blue shirt with no logos". Prior work has largely addressed this limitation through data-centric approaches, fine-tuning CLIP on large-scale synthetic negation datasets. However, these efforts are commonly evaluated using retrieval-based metrics that cannot reliably reflect whether negation is actually understood. In this paper, we identify two key limitations of such evaluation metrics and investigate an alternative evaluation framework based on Multimodal LLMs-as-a-judge, which typically excel at understanding simple yes/no questions about image content, providing a fair evaluation of negation understanding in CLIP models. We then ask whether there already exists a direction in the CLIP embedding space associated with negation. We find evidence that such a direction exists, and show that it can be manipulated through test-time intervention via representation engineering to steer CLIP toward negation-aware behavior without any fine-tuning. Finally, we test negation understanding on non-common image-text samples to evaluate generalization under distribution shifts.

Fawaz Sammani +3

about 1 month ago

arXiv 2603.20554v1

cs.CVCV具身智能

Memory Over Maps: 3D Object Localization Without Reconstruction

This paper addresses the fundamental question of whether complete 3D scene reconstruction is necessary for object localization in embodied tasks. The authors propose a map-free pipeline that stores only posed RGB-D keyframes as a lightweight visual memory, eliminating the need for global 3D representations. At query time, the method retrieves candidate views and re-ranks them using a vision-language model for semantic reasoning. A sparse on-demand 3D estimate of the target is constructed through depth backprojection, enabling efficient localization without expensive reconstruction. This approach significantly reduces mapping time, storage overhead, and scalability limitations while maintaining effective performance for navigation and manipulation tasks.

about 1 month ago

arXiv 2603.20530v1

cs.CLTransformer具身智能

Profiling learners' affective engagement: Emotion AI, intercultural pragmatics, and language learning

This paper examines the emotional dimensions of second language acquisition, highlighting how learners experience frustration and triumph throughout their language learning journey. The research explores the application of Emotion AI technologies to monitor and support learner affective states during language practice. It investigates the role of AI chatbots as tools for developing conversational abilities and intercultural pragmatic competence. The study addresses both the benefits of AI-powered language partners, such as responsiveness and non-judgmental interaction, and their limitations including emotional voidness and cultural biases. This work contributes to the growing intersection of artificial intelligence and language education, demonstrating how language models can be leveraged to create more supportive and culturally-aware learning environments.

about 1 month ago

arXiv 2603.20479v1

cs.CL具身智能

Coding Agents are Effective Long-Context Processors

Large Language Models (LLMs) have demonstrated remarkable progress in scaling to access massive contexts. However, the access is via the latent and uninterpretable attention mechanisms, and LLMs fail to effective process long context, exhibiting significant performance degradation as context length increases. In this work, we study whether long-context processing can be externalized from latent attention into explicit, executable interactions, by allowing coding agents to organize text in file systems and manipulate it using its native tools. We evaluate off-the-shelf frontier coding agents as the general interface for tasks that require processing long contexts, including long-context reasoning, retrieval-augmented generation, and open-domain question answering with large-scale corpus contains up to three trillion tokens. Across multiple benchmarks, these agents outperform published state-of-the-art by 17.3% on average. We attribute this efficacy to two key factors: native tool proficiency, which enables agents to leverage executable code and terminal commands rather than passive semantic queries, and file system familiarity, which allows them to navigate massive text corpora as directory structures. These findings suggest that delegating long-context processing to coding agents offers an effective alternative to semantic search or context window scaling, opening new directions for long-context processing in LLMs.

about 1 month ago

arXiv 2603.20432v1

cs.AICVTransformer

CAMA: Exploring Collusive Adversarial Attacks in Cooperative Multi-Agent Reinforcement Learning

Cooperative multi-agent reinforcement learning (c-MARL) has been widely deployed in real-world applications, including social robots, embodied intelligence, and UAV swarms. However, various adversarial attacks continue to threaten c-MARL systems. Existing studies primarily focus on single-adversary perturbation attacks and white-box adversarial attacks that manipulate agents' internal observations or actions. This paper proposes a novel study of collusive adversarial attacks by strategically organizing malicious agents into three collusive attack modes: Collective Malicious Agents, Disguised Malicious Agents, and Spied Malicious Agents. The proposed unified framework CAMA enables policy-level collusive attacks, with attack effectiveness theoretically analyzed from perspectives of disruptiveness and stealthiness.

CAMA Research Team

about 1 month ago

arXiv 2603.20390v1

cs.CLTransformer具身智能

The Production of Meaning in the Processing of Natural Language

Understanding the fundamental mechanisms governing the production of meaning in natural language processing is critical for designing safe and engaging human-agent interactions. Research in cognitive science and social psychology has demonstrated that human semantic processing exhibits contextuality more consistent with quantum logical mechanisms than classical Boolean theories. Recent studies have found similar quantum-like behavioral signatures in large language models, including clear violations of the Bell inequality during interpretation of ambiguous expressions. This work explores the CHSH parameter across the inference parameter space of language models spanning four orders of magnitude in scale, cross-referencing findings with MMLU benchmarks, hallucination rates, and nonsense detection metrics to understand semantic contextuality in neural language systems.

Anonymous Authors

about 1 month ago

arXiv 2603.20381v1

cs.RO端到端Transformer

AGILE: A Comprehensive Workflow for Humanoid Loco-Manipulation Learning

This paper presents AGILE, an end-to-end workflow designed to address the challenges of transferring reinforcement learning policies from simulation to real humanoid robots. The framework standardizes the policy-development lifecycle through four key stages: interactive environment verification, reproducible training, unified evaluation, and descriptor-driven deployment. By mitigating common sim-to-real failure modes, AGILE enables systematic development of loco-manipulation skills for humanoid robots. The approach includes scenario-based tests and randomized rollouts under motion-quality diagnostics for automated regression testing and robustness assessment.

about 1 month ago

arXiv 2603.20147v1

cs.ROCV目标检测

KUKAloha: A General, Low-Cost, and Shared-Control based Teleoperation Framework for Construction Robot Arm

This paper presents KUKAloha, a teleoperation framework for construction robot arms combining shared-control with autonomous perception. The system uses a leader-follower paradigm where a lightweight guiding arm enables intuitive human control for coarse motion, while an AprilTag-based perception module handles precise alignment and grasping. By separating human guidance from fine manipulation, the framework enhances safety and repeatability when operating large construction manipulators. Experiments on a KUKA robot arm demonstrate reduced operator workload and improved task efficiency for scalable demonstration collection in construction environments.

about 1 month ago

arXiv 2603.20129v1

cs.CR具身智能

Trojan's Whisper: Stealthy Manipulation of OpenClaw through Injected Bootstrapped Guidance

Autonomous coding agents are increasingly integrated into software development workflows, offering capabilities that extend beyond code suggestion to active system interaction and environment management. OpenClaw, a representative platform in this emerging paradigm, introduces an extensible skill ecosystem that allows third-party developers to inject behavioral guidance through lifecycle hooks during agent initialization. While this design enhances automation and customization, it also opens a novel and unexplored attack surface. In this paper, we identify and systematically characterize guidance injection, a stealthy attack vector that embeds adversarial operational narratives into bootstrap guidance files. Unlike traditional prompt injection, which relies on explicit malicious instructions, guidance injection manipulates the agent's reasoning context by framing harmful actions as routine best practices. These narratives are automatically incorporated into the agent's interpretive framework and influence future task execution without raising suspicion.We construct 26 malicious skills spanning 13 attack categories including credential exfiltration, workspace destruction, privilege escalation, and persistent backdoor installation. We evaluate them using ORE-Bench, a realistic developer workspace benchmark we developed. Across 52 natural user prompts and six state-of-the-art LLM backends, our attacks achieve success rates from 16.0% to 64.2%, with the majority of malicious actions executed autonomously without user confirmation. Furthermore, 94% of our malicious skills evade detection by existing static and LLM-based scanners. Our findings reveal fundamental tensions in the design of autonomous agent ecosystems and underscore the urgent need for defenses based on capability isolation, runtime policy enforcement, and transparent guidance provenance.

about 1 month ago

arXiv 2603.19974v1

cs.CV端到端CV

Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis

This paper presents Dream2Act, a robot-centric framework enabling zero-shot interaction through generative video synthesis for humanoid robots. The approach addresses the morphology gap in traditional human-to-robot motion retargeting by synthesizing robot-native motion directly. Given a third-person image of the robot and target object, video generation models envision the robot completing tasks with morphology-consistent motion. A high-fidelity pose extraction system recovers physically feasible joint trajectories from synthesized videos, which are subsequently executed via a general-purpose whole-body controller.

about 1 month ago

arXiv 2603.19709v1

cs.CVCVTransformer

Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis

This paper presents Dream2Act, a robot-centric framework enabling zero-shot interaction for humanoid robots through generative video synthesis. By taking a third-person image of the robot and target object, the framework leverages video generation models to synthesize morphology-consistent motion for task completion. The approach employs a high-fidelity pose extraction system to recover physically feasible, robot-native joint trajectories from the synthesized videos. These trajectories are subsequently executed via a general-purpose whole-body controller, eliminating the need for extensive policy training or explicit motion retargeting that suffers from morphology gaps.

about 1 month ago

arXiv 2603.19709v2

cs.CVCV3D检测

WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

This paper investigates whether 2D foundation image models inherently possess 3D world model capabilities by evaluating their performance on 3D world synthesis tasks. The authors propose a multi-agent architecture consisting of a VLM-based director, an image synthesizer, and a two-step verifier that evaluates outputs from both 2D image and 3D reconstruction spaces. Through systematic benchmarking of state-of-the-art image generation models and Vision-Language Models, they demonstrate that their agentic approach achieves coherent and robust 3D reconstruction, enabling exploration through novel view rendering. The research provides insights into leveraging implicit 3D knowledge from 2D foundation models for world-level scene understanding and generation.

WorldAgents Team

about 1 month ago

arXiv 2603.19708v1

cs.LG具身智能

Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations

Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or attach generative components whose parameters confound attribution of behavior to the encoder. We propose the AI Mother Tongue (AIM) framework as a passive quantization probe: a lightweight, vocabulary-free probe that converts V-JEPA 2 continuous latent vectors into discrete symbol sequences without task-specific supervision or modifying the encoder. Because the encoder is kept completely frozen, any symbolic structure in the AIM codebook is attributable entirely to V-JEPA 2 pre-trained representations -- not to the probe. We evaluate through category-contrast experiments on Kinetics-mini along three physical dimensions: grasp angle, object geometry, and motion temporal structure. AIM symbol distributions differ significantly across all three experiments (chi^2 p < 10^{-4}; MI 0.036--0.117 bits, NMI 1.2--3.9% of the 3-bit maximum; JSD up to 0.342; codebook active ratio 62.5%). The experiments reveal that V-JEPA 2 latent space is markedly compact: diverse action categories share a common representational core, with semantic differences encoded as graded distributional variations rather than categorical boundaries. These results establish Stage 1 of a four-stage roadmap toward an action-conditioned symbolic world model, demonstrating that structured symbolic manifolds are discoverable properties of frozen JEPA latent spaces.

about 1 month ago

arXiv 2603.20327v1

cs.CVCVTransformer

In-the-Wild Camouflage Attack on Vehicle Detectors through Controllable Image Editing

Deep neural networks have achieved remarkable success in computer vision but remain highly vulnerable to adversarial attacks. This paper proposes a new framework that formulates vehicle camouflage attacks as a conditional image-editing problem, exploring both image-level and scene-level camouflage generation strategies. The method fine-tunes a ControlNet to synthesize camouflaged vehicles directly on real images while enforcing vehicle structural fidelity, style consistency, and adversarial effectiveness through a unified objective. Experiments on COCO and LINZ datasets demonstrate that the approach achieves significantly stronger attack effectiveness with more than 38% AP50 decrease, while better preserving vehicle structure and improving human-perceived stealthiness compared to existing methods.

Anonymous Authors

about 1 month ago

arXiv 2603.19456v1

cs.RO3D检测具身智能

A Closed-Form CLF-CBF Controller for Whole-Body Continuum Soft Robot Collision Avoidance

Safe operation is crucial for deploying robots in human-centered 3D environments. Soft continuum manipulators offer passive safety through mechanical compliance but require active control for reliable collision avoidance. This paper presents a closed-form Control Lyapunov Function and Control Barrier Function controller for real-time 3D obstacle avoidance in soft continuum manipulators without online optimization. The method analytically embeds safety constraints into control inputs, ensuring stability and safety while avoiding feasibility issues of optimization-based approaches.

about 1 month ago

arXiv 2603.19424v1

cs.CL端到端Transformer

The Autonomy Tax: Defense Training Breaks LLM Agents

Large language model agents increasingly rely on external tools such as file operations, API calls, and database transactions to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations. This paper reveals a fundamental capability-alignment paradox where defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents including agent incompetence bias manifesting as immediate tool execution breakdown and cascade amplification bias causing early failures to propagate through retry loops. These findings demonstrate that current defense training approaches create significant trade-offs between safety and functional capability in autonomous LLM agents.

Anonymous Authors

about 1 month ago

arXiv 2603.19423v1

上一页第 2 / 3 页下一页