CV

共 68 篇论文

cs.CV自动驾驶CV

OmniPatch: A Universal Adversarial Patch for ViT-CNN Cross-Architecture Transfer in Semantic Segmentation

This paper presents OmniPatch, a novel training framework designed to generate universal adversarial patches that can attack semantic segmentation models across different architectures including both Vision Transformers (ViT) and Convolutional Neural Networks (CNN). The approach addresses the critical challenge of black-box adversarial attacks in autonomous driving systems where target model parameters are unknown. By learning patches that generalize across images and architectures without requiring access to target models, this work provides a practical solution for evaluating robustness of deployed perception systems. The framework specifically targets semantic segmentation, which is essential for safe autonomous driving navigation.

Aarush Aggarwal +3

29 days ago

arXiv 2603.20777v1

cs.CV端到端CV

Speedup Patch: Learning a Plug-and-Play Policy to Accelerate Embodied Manipulation

This paper presents Speedup Patch (SuP), a lightweight policy-agnostic framework designed to accelerate embodied manipulation by adaptively downsampling action chunks from existing policies. The method formulates the scheduler optimization as a Constrained Markov Decision Process to maximize efficiency while maintaining task performance. To address offline safety constraints, the approach introduces World Model based state deviation as a surrogate metric for success evaluation. SuP demonstrates that embodied manipulation tasks can be significantly accelerated without requiring policy retraining or costly online interactions.

SuP Authors

29 days ago

arXiv 2603.20658v1

cs.CV自动驾驶CV

GHOST: Ground-projected Hypotheses from Observed Structure-from-Motion Trajectories

We present a scalable self-supervised approach for segmenting feasible vehicle trajectories from monocular images for autonomous driving in complex urban environments. Our method leverages large-scale dashcam videos, treating recorded ego-vehicle motion as implicit supervision to recover camera trajectories via monocular structure-from-motion. These trajectories are projected onto the ground plane to generate spatial masks of traversed regions without manual annotation. We train a deep segmentation network that predicts motion-conditioned path proposals from a single RGB image at runtime, without explicit modeling of road or lane markings. The model implicitly captures scene layout, lane topology, and intersection structure, demonstrating generalization across varying camera configurations. We evaluate on NuScenes for reliable trajectory prediction and show transfer capability to an electric scooter platform.

Anonymous

29 days ago

arXiv 2603.20583v1

cs.CVCV具身智能

Memory Over Maps: 3D Object Localization Without Reconstruction

This paper addresses the fundamental question of whether complete 3D scene reconstruction is necessary for object localization in embodied tasks. The authors propose a map-free pipeline that stores only posed RGB-D keyframes as a lightweight visual memory, eliminating the need for global 3D representations. At query time, the method retrieves candidate views and re-ranks them using a vision-language model for semantic reasoning. A sparse on-demand 3D estimate of the target is constructed through depth backprojection, enabling efficient localization without expensive reconstruction. This approach significantly reduces mapping time, storage overhead, and scalability limitations while maintaining effective performance for navigation and manipulation tasks.

Anonymous

30 days ago

arXiv 2603.20530v1

cs.AICVTransformer

CAMA: Exploring Collusive Adversarial Attacks in Cooperative Multi-Agent Reinforcement Learning

Cooperative multi-agent reinforcement learning (c-MARL) has been widely deployed in real-world applications, including social robots, embodied intelligence, and UAV swarms. However, various adversarial attacks continue to threaten c-MARL systems. Existing studies primarily focus on single-adversary perturbation attacks and white-box adversarial attacks that manipulate agents' internal observations or actions. This paper proposes a novel study of collusive adversarial attacks by strategically organizing malicious agents into three collusive attack modes: Collective Malicious Agents, Disguised Malicious Agents, and Spied Malicious Agents. The proposed unified framework CAMA enables policy-level collusive attacks, with attack effectiveness theoretically analyzed from perspectives of disruptiveness and stealthiness.

CAMA Research Team

30 days ago

arXiv 2603.20390v1

cs.CV自动驾驶CV

Wildfire Spread Scenarios: Increasing Sample Diversity of Segmentation Diffusion Models with Training-Free Methods

This paper addresses the challenge of sample-efficient ambiguous segmentation in uncertain environments such as wildfire spread, medical diagnosis, and autonomous driving. The authors evaluate several training-free sampling methods including particle guidance and SPELL, adapting them from natural image generation to discrete segmentation tasks. They also propose a novel clustering-based technique to encourage diverse predictions from diffusion models. Validation is performed on the LIDC medical dataset, a modified Cityscapes dataset for autonomous driving scenarios, and a new MMFire wildfire spread simulation dataset. The work demonstrates that training-free methods can effectively generate diverse plausible segmentation outcomes without additional training overhead.

Anonymous Authors

30 days ago

arXiv 2603.20188v1

cs.CV自动驾驶CV

IndoorR2X: Indoor Robot-to-Everything Coordination with LLM-Driven Planning

IndoorR2X introduces a novel benchmark and simulation framework for LLM-driven multi-robot task planning in indoor environments. The system enables Robot-to-Everything communication by integrating observations from mobile robots and static IoT sensors like cameras to construct a global semantic state. This approach overcomes partial observability challenges that traditional R2R communication faces, reducing redundant exploration and enabling scalable scene understanding. The framework provides configurable simulation environments, sensor layouts, robot teams, and task suites for evaluating coordinated indoor robot operations.

IndoorR2X Authors

30 days ago

arXiv 2603.20182v1

cs.CV自动驾驶CV

The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning

This paper presents CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework enabling robots to critique and replan their own social behaviors using a Vision-Language Model as a human-like social critic. The framework integrates joint and constraint extraction from robot description files, step-by-step behavior planning, low-level joint control code generation from visual information, VLM-based evaluation of social appropriateness, and iterative refinement through reward-based search. This approach enables robots to generate human-like, socially appropriate motions across various platforms with improved autonomy and naturalness.

Author 1 +1

30 days ago

arXiv 2603.20164v1

cs.ROCV目标检测

KUKAloha: A General, Low-Cost, and Shared-Control based Teleoperation Framework for Construction Robot Arm

This paper presents KUKAloha, a teleoperation framework for construction robot arms combining shared-control with autonomous perception. The system uses a leader-follower paradigm where a lightweight guiding arm enables intuitive human control for coarse motion, while an AprilTag-based perception module handles precise alignment and grasping. By separating human guidance from fine manipulation, the framework enhances safety and repeatability when operating large construction manipulators. Experiments on a KUKA robot arm demonstrate reduced operator workload and improved task efficiency for scalable demonstration collection in construction environments.

KUKAloha Team

30 days ago

arXiv 2603.20129v1

cs.CV自动驾驶端到端

X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving

X-World is an action-conditioned multi-camera generative world model designed for scalable evaluation in end-to-end autonomous driving. The system generates realistic future observations by mapping synchronized multi-view camera history and future action sequences to video streams that accurately follow commanded driving actions. By simulating future multi-camera video outputs, X-World enables reproducible and controllable testing of vision-language-action (VLA) policies. The framework further supports optional control over dynamic traffic agents and static road elements, making it a comprehensive real-world simulator for autonomous vehicle development and validation.

X-World Team

30 days ago

arXiv 2603.19979v1

cs.CV端到端CV

Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis

This paper presents Dream2Act, a robot-centric framework enabling zero-shot interaction through generative video synthesis for humanoid robots. The approach addresses the morphology gap in traditional human-to-robot motion retargeting by synthesizing robot-native motion directly. Given a third-person image of the robot and target object, video generation models envision the robot completing tasks with morphology-consistent motion. A high-fidelity pose extraction system recovers physically feasible joint trajectories from synthesized videos, which are subsequently executed via a general-purpose whole-body controller.

Dream2Act Team

about 1 month ago

arXiv 2603.19709v1

cs.CVCVTransformer

Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis

This paper presents Dream2Act, a robot-centric framework enabling zero-shot interaction for humanoid robots through generative video synthesis. By taking a third-person image of the robot and target object, the framework leverages video generation models to synthesize morphology-consistent motion for task completion. The approach employs a high-fidelity pose extraction system to recover physically feasible, robot-native joint trajectories from the synthesized videos. These trajectories are subsequently executed via a general-purpose whole-body controller, eliminating the need for extensive policy training or explicit motion retargeting that suffers from morphology gaps.

Dream2Act Team

about 1 month ago

arXiv 2603.19709v2

cs.CVCV3D检测

WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

This paper investigates whether 2D foundation image models inherently possess 3D world model capabilities by evaluating their performance on 3D world synthesis tasks. The authors propose a multi-agent architecture consisting of a VLM-based director, an image synthesizer, and a two-step verifier that evaluates outputs from both 2D image and 3D reconstruction spaces. Through systematic benchmarking of state-of-the-art image generation models and Vision-Language Models, they demonstrate that their agentic approach achieves coherent and robust 3D reconstruction, enabling exploration through novel view rendering. The research provides insights into leveraging implicit 3D knowledge from 2D foundation models for world-level scene understanding and generation.

WorldAgents Team

about 1 month ago

arXiv 2603.19708v1

cs.CVCVTransformer

In-the-Wild Camouflage Attack on Vehicle Detectors through Controllable Image Editing

Deep neural networks have achieved remarkable success in computer vision but remain highly vulnerable to adversarial attacks. This paper proposes a new framework that formulates vehicle camouflage attacks as a conditional image-editing problem, exploring both image-level and scene-level camouflage generation strategies. The method fine-tunes a ControlNet to synthesize camouflaged vehicles directly on real images while enforcing vehicle structural fidelity, style consistency, and adversarial effectiveness through a unified objective. Experiments on COCO and LINZ datasets demonstrate that the approach achieves significantly stronger attack effectiveness with more than 38% AP50 decrease, while better preserving vehicle structure and improving human-perceived stealthiness compared to existing methods.

Anonymous Authors

about 1 month ago

arXiv 2603.19456v1

cs.RO端到端CV

Speculative Policy Orchestration: A Latency-Resilient Framework for Cloud-Robotic Manipulation

This paper presents Speculative Policy Orchestration (SPO), a novel latency-resilient framework designed for cloud-robotics manipulation tasks. The proposed approach enables robots to offload computationally intensive motion planning to remote cloud servers while maintaining stable high-frequency control at the edge. By utilizing a cloud-hosted world model to pre-compute and stream kinematic waypoints, the system effectively decouples execution frequency from network round-trip latency. To ensure safe operation, an epsilon-tube verifier bounds kinematic execution errors, preventing unsafe predictive drift. Additionally, an Adaptive Horizon Scaling mechanism dynamically adjusts the speculative pre-fetch depth based on real-time tracking performance. The framework is validated through continuous manipulation experiments on RLBench under emulated network delay conditions.

Anonymous Authors

about 1 month ago

arXiv 2603.19418v1

cs.CV端到端CV

VAMPO: Policy Optimization for Improving Visual Dynamics in Video Action Models

This paper presents VAMPO, a post-training framework that improves visual dynamics in video action models for robot control. The key contribution is formulating multi-step denoising as a sequential decision process and optimizing the denoising policy with rewards defined over expert visual dynamics in latent space. The approach addresses the objective mismatch in current diffusion-based video predictors by explicitly optimizing precision-critical visual dynamics needed for manipulation tasks.

VAMPO Authors

about 1 month ago

arXiv 2603.19370v1

cs.CVCVTransformer

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

This paper addresses the spatial blindness problem in Multimodal Large Language Models by leveraging implicit 3D priors learned in video generation models. The authors propose VEGA-3D, a framework that repurposes pre-trained video diffusion models as latent world simulators to extract robust 3D structural priors and physical understanding. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations through token-level adaptive gated fusion, the method enriches MLLMs with dense geometric cues for improved scene understanding and geometric reasoning capabilities.

VEGA-3D Authors

about 1 month ago

arXiv 2603.19235v1

cs.CVCVTransformer

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Vision-Language-Action (VLA) models integrate perception, language understanding, and motor control into unified architectures for robotic control. This study systematically investigates how VLAs process multimodal inputs to generate actions through activation injection, sparse autoencoders, and linear probes across six models (80M-7B parameters) using 394,000+ rollout episodes. The research reveals that the visual pathway is the dominant factor in action generation across all architectures, as injecting baseline activations into null-prompt episodes reproduces nearly identical behavior. Cross-task injection experiments demonstrate spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity is found to depend on task structure rather than model design: when visual context uniquely specifies the task, language is ignored, but when multiple goals share a scene, language becomes essential (achieving 94% accuracy in X-VLA libero_goal). These findings provide mechanistic insights into VLA operation and have implications for embodied AI systems requiring precise vision-language coordination for robotic manipulation.

Research Team

about 1 month ago

arXiv 2603.19233v1

cs.CV端到端CV

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

This paper presents MonoArt, a unified framework for reconstructing articulated 3D objects from single images through progressive structural reasoning. The method addresses the challenge of inferring object geometry, part structure, and motion parameters from limited visual evidence without direct articulation regression. By progressively transforming visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture, the framework enables stable and interpretable articulation inference without external templates or multi-stage pipelines. The approach is validated on articulated object datasets demonstrating effective 3D reconstruction of objects with movable parts.

MonoArt Team

about 1 month ago

arXiv 2603.19231v1

cs.CVCVTransformer

OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation

This paper presents OmniVTA, a world-model-based visuo-tactile manipulation framework designed for contact-rich robotic manipulation tasks such as wiping and assembly. The work introduces OmniViTac, a large-scale dataset comprising 21,000+ trajectories across 86 tasks and 100+ objects with six physics-grounded interaction patterns. The framework integrates four tightly coupled modules including a self-supervised tactile encoder and a two-stream visuo-tactile world model for predicting contact dynamics. The research addresses limitations in existing methods by treating tactile signals actively to model contact dynamics and enable explicit closed-loop control, moving beyond passive observation approaches.

OmniVTA Authors

about 1 month ago

arXiv 2603.19201v2