自动驾驶

共 56 篇论文

cs.RO自动驾驶端到端

Can a Robot Walk the Robotic Dog: Triple-Zero Collaborative Navigation for Heterogeneous Multi-Agent Systems

This paper presents Triple Zero Path Planning (TZPP), a collaborative navigation framework for heterogeneous multi-robot systems achieving zero training, zero prior knowledge, and zero simulation requirements. The system employs a coordinator-explorer architecture where a Unitree G1 humanoid robot performs task coordination while a Unitree Go2 quadruped robot explores and identifies feasible paths using guidance from a multimodal large language model. The framework is evaluated across diverse indoor and outdoor environments including obstacle-rich and landmark-sparse settings, demonstrating robust and human-comparable navigation efficiency. By eliminating reliance on traditional training and simulation pipelines, TZPP provides a practical approach for real-world deployment of heterogeneous robot cooperation with strong adaptability to unseen scenarios.

TZPP Authors

27 days ago

arXiv 2603.21723v1

cs.CL自动驾驶端到端

MIND: Multi-agent Inference for Negotiation Dialogue in Travel Planning

This paper introduces MIND (Multi-agent Inference for Negotiation Dialogue), a framework for simulating realistic consensus-building among travelers with heterogeneous preferences in travel planning scenarios. Grounded in Theory of Mind (ToM), MIND incorporates a Strategic Appraisal phase that achieves 90.2% accuracy in inferring opponent willingness from linguistic nuances. The framework demonstrates significant improvements over traditional Multi-Agent Debate (MAD) approaches, with a 20.5% improvement in High-w Hit and 30.7% increase in Debate Hit-Rate. Qualitative evaluations using LLM-as-a-Judge confirm superior performance in Rationality (68.8%) and Fluency (72.4%), achieving an overall win rate of 68.3%. This work effectively models human negotiation dynamics through advanced language understanding and multi-agent reasoning.

MIND Authors

27 days ago

arXiv 2603.21696v1

cs.CV自动驾驶CV

BiPreManip: Learning Affordance-Based Bimanual Preparatory Manipulation through Anticipatory Collaboration

This paper addresses bimanual robot manipulation tasks where coordinated actions between two robotic arms are required for complex object interactions. The work introduces a Collaborative Preparatory Manipulation framework that enables robots to perform sequential preparatory actions - such as pushing objects to accessible positions or lifting items - to facilitate subsequent goal-directed manipulations by the other arm. The proposed visual affordance-based approach first anticipates the final task objective and then generates appropriate preparatory manipulations, requiring deep understanding of object geometry, spatial relationships, and semantic properties. By learning from demonstrations and employing vision-based affordance recognition, the framework achieves effective bimanual coordination for tasks involving objects that are difficult to grasp directly.

BiPreManip Authors

27 days ago

arXiv 2603.21679v1

cs.CV自动驾驶端到端

Cross-Scenario Deraining Adaptation with Unpaired Data: Superpixel Structural Priors and Multi-Stage Pseudo-Rain Synthesis

Image deraining is a critical low-level computer vision task essential for robust outdoor surveillance and autonomous driving systems. While deep learning methods have shown success in aligned training settings, they typically experience significant performance degradation when applied to unseen Out-of-Distribution scenarios due to domain discrepancies between synthetic training data and real-world rain dynamics. This paper proposes a cross-scenario deraining adaptation framework that eliminates the need for paired rainy observations in target domains, utilizing only rain-free background images. The method incorporates a Superpixel Generation module that extracts stable structural priors from source domains using Simple Linear Iterative Clustering, enabling effective rain removal across diverse scenarios.

Anonymous

27 days ago

arXiv 2603.21661v1

cs.RO自动驾驶端到端

RTD-RAX: Fast, Safe Trajectory Planning for Systems under Unknown Disturbances

This paper presents RTD-RAX, a runtime-assurance extension of Reachability-based Trajectory Design that addresses two critical limitations of standard RTD implementations: conservatism from worst-case reachable-set overapproximations and inability to handle real-time disturbances. The framework combines a non-conservative RTD formulation for rapid goal-directed trajectory generation with mixed monotone reachability for fast disturbance-aware safety certification. When proposed trajectories fail safety verification under uncertainty, a repair procedure finds nearby safe trajectories that maintain progress toward the goal while guaranteeing safety under real-time disturbances. This approach enables provably safe, real-time trajectory planning for autonomous systems operating in uncertain environments.

S. K. S. Mitraz +3

27 days ago

arXiv 2603.21635v1

cs.CV自动驾驶CV

PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation

Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networks often overlook clinically observed spatial priors of tumor occurrence, leading to redundant feature computation over extensive background regions. To address this issue, we propose PGR-Net (Prior-Guided ROI Reasoning Network) - an explicit ROI-aware framework that incorporates a data-driven spatial prior set to capture the distribution and scale characteristics of tumor lesions, providing global guidance for more stable segmentation. Leveraging these priors, PGR-Net introduces a hierarchical Top-K ROI decision mechanism that progressively selects the most confident lesion candidate regions across encoder layers to improve localization precision.

PGR-Net Authors

27 days ago

arXiv 2603.21626v1

cs.CV自动驾驶端到端

Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications

This paper addresses data transmission challenges in UAV-assisted wireless networks where multiple unmanned aerial vehicles serve as relays between ground users and a remote base station. The proposed delay-tolerant multi-agent deep reinforcement learning algorithm jointly optimizes trajectory planning, network formation, and transmission control while incorporating a delay-penalized reward mechanism to encourage inter-UAV information sharing. To handle information loss from unreliable channel conditions, a spatio-temporal attention mechanism predicts and recovers missing network state information, enhancing each UAV's situational awareness for improved collaboration and overall network throughput.

Anonymous Authors

27 days ago

arXiv 2603.21594v1

cs.CV自动驾驶端到端

HACMatch Semi-Supervised Rotation Regression with Hardness-Aware Curriculum Pseudo Labeling

Regressing 3D rotations of objects from 2D images is a crucial yet challenging task, with broad applications in autonomous driving, virtual reality, and robotic control. This paper proposes a hardness-aware curriculum learning framework for semi-supervised rotation regression that dynamically selects pseudo-labeled samples based on their difficulty, progressing from easy to complex examples. The approach addresses limitations of rigid entropy-based pseudo-label filtering by introducing both multi-stage and adaptive curriculum strategies to effectively distinguish between reliable and unreliable unlabeled samples.

HACMatch Authors

27 days ago

arXiv 2603.21583v1

cs.CV自动驾驶CV

Mind over Space: Can Multimodal Large Language Models Mentally Navigate?

This paper addresses the limitation of Multimodal Large Language Models (MLLMs) in embodied agents, which struggle with spatial reasoning across extensive spatiotemporal scales. The authors introduce Video2Mental, a benchmark that evaluates mental navigation capabilities by requiring models to construct hierarchical cognitive maps from long egocentric videos and generate landmark-based path plans. The research draws inspiration from cognitive science, exploring how biological intelligence uses mental navigation and spatial simulation prior to action. Benchmarking results demonstrate that mental navigation capabilities do not naturally emerge from standard pre-training, with frontier MLLMs showing significant challenges in zero-shot scenarios. Planning accuracy is verified through simulator-based physical interaction, providing a comprehensive evaluation framework for embodied AI systems.

Unknown

27 days ago

arXiv 2603.21577v1

physics.optics自动驾驶

PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection

Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step -- a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm -- the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4-6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine. Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four-order-of-magnitude energy advantage over GPU baselines at practical context lengths (n >= 4K).

Hyoseok Park +1

27 days ago

arXiv 2603.21576v1

cs.NE自动驾驶

Evolutionary Biparty Multiobjective UAV Path Planning: Problems and Empirical Comparisons

Unmanned aerial vehicles (UAVs) have been widely used in urban missions, and proper planning of UAV paths can improve mission efficiency while reducing the risk of potential third-party impact. Existing work has considered all efficiency and safety objectives for a single decision-maker (DM) and regarded this as a multiobjective optimization problem (MOP). However, there is usually not a single DM but two DMs, i.e., an efficiency DM and a safety DM, and the DMs are only concerned with their respective objectives. The final decision is made based on the solutions of both DMs. In this paper, for the first time, biparty multiobjective UAV path planning (BPMO-UAVPP) problems involving both efficiency and safety departments are modeled. The existing multiobjective immune algorithm with nondominated neighbor-based selection (NNIA), the hybrid evolutionary framework for the multiobjective immune algorithm (HEIA), and the adaptive immune-inspired multiobjective algorithm (AIMA) are modified for solving the BPMO-UAVPP problem, and then biparty multiobjective optimization algorithms, including the BPNNIA, BPHEIA, and BPAIMA, are proposed and comprehensively compared with traditional multiobjective evolutionary algorithms and typical multiparty multiobjective evolutionary algorithms (i.e., OptMPNDS and OptMPNDS2). The experimental results show that BPAIMA performs better than ordinary multiobjective evolutionary algorithms such as NSGA-II and multiparty multiobjective evolutionary algorithms such as OptMPNDS, OptMPNDS2, BPNNIA and BPHEIA.

Kesheng Chen +4

27 days ago

arXiv 2603.21544v1

cs.RO自动驾驶

SafePilot: A Framework for Assuring LLM-enabled Cyber-Physical Systems

Large Language Models (LLMs), deep learning architectures with typically over 10 billion parameters, have recently begun to be integrated into various cyber-physical systems (CPS) such as robotics, industrial automation, and autopilot systems. The abstract knowledge and reasoning capabilities of LLMs are employed for tasks like planning and navigation. However, a significant challenge arises from the tendency of LLMs to produce "hallucinations" - outputs that are coherent yet factually incorrect or contextually unsuitable. This characteristic can lead to undesirable or unsafe actions in the CPS. Therefore, our research focuses on assuring the LLM-enabled CPS by enhancing their critical properties. We propose SafePilot, a novel hierarchical neuro-symbolic framework that provides end-to-end assurance for LLM-enabled CPS according to attribute-based and temporal specifications. Given a task and its specification, SafePilot first invokes a hierarchical planner with a discriminator that assesses task complexity. If the task is deemed manageable, it is passed directly to an LLM-based task planner with built-in verification. Otherwise, the hierarchical planner applies a divide-and-conquer strategy, decomposing the task into sub-tasks, each of which is individually planned and later merged into a final solution. The LLM-based task planner translates natural language constraints into formal specifications and verifies the LLM's output against them. If violations are detected, it identifies the flaw, adjusts the prompt accordingly, and re-invokes the LLM. This iterative process continues until a valid plan is produced or a predefined limit is reached. Our framework supports LLM-enabled CPS with both attribute-based and temporal constraints. Its effectiveness and adaptability are demonstrated through two illustrative case studies.

Weizhe Xu +2

27 days ago

arXiv 2603.21523v1

cs.RO自动驾驶

GaussianSSC: Triplane-Guided Directional Gaussian Fields for 3D Semantic Completion

We present \emph{GaussianSSC}, a two-stage, grid-native and triplane-guided approach to semantic scene completion (SSC) that injects the benefits of Gaussians without replacing the voxel grid or maintaining a separate Gaussian set. We introduce \emph{Gaussian Anchoring}, a sub-pixel, Gaussian-weighted image aggregation over fused FPN features that tightens voxel--image alignment and improves monocular occupancy estimation. We further convert point-like voxel features into a learned per-voxel Gaussian field and refine triplane features via a triplane-aligned \emph{Gaussian--Triplane Refinement} module that combines \emph{local gathering} (target-centric) and \emph{global aggregation} (source-centric). This directional, anisotropic support captures surface tangency, scale, and occlusion-aware asymmetry while preserving the efficiency of triplane representations. On SemanticKITTI~\cite{behley2019semantickitti}, GaussianSSC improves Stage~1 occupancy by +1.0\% Recall, +2.0\% Precision, and +1.8\% IoU over state-of-the-art baselines, and improves Stage~2 semantic prediction by +1.8\% IoU and +0.8\% mIoU.

Ruiqi Xian +4

27 days ago

arXiv 2603.21487v1

cs.RO自动驾驶

HyReach: Vision-Guided Hybrid Manipulator Reaching in Unseen Cluttered Environments

As robotic systems increasingly operate in unstructured, cluttered, and previously unseen environments, there is a growing need for manipulators that combine compliance, adaptability, and precise control. This work presents a real-time hybrid rigid-soft continuum manipulator system designed for robust open-world object reaching in such challenging environments. The system integrates vision-based perception and 3D scene reconstruction with shape-aware motion planning to generate safe trajectories. A learning-based controller drives the hybrid arm to arbitrary target poses, leveraging the flexibility of the soft segment while maintaining the precision of the rigid segment. The system operates without environment-specific retraining, enabling direct generalization to new scenes. Extensive real-world experiments demonstrate consistent reaching performance with errors below 2 cm across diverse cluttered setups, highlighting the potential of hybrid manipulators for adaptive and reliable operation in unstructured environments.

Shivani Kamtikar +7

27 days ago

arXiv 2603.21421v1

cs.CV自动驾驶

Bayesian Active Object Recognition and 6D Pose Estimation from Multimodal Contact Sensing

We present an active tactile exploration framework for joint object recognition and 6D pose estimation. The proposed method integrates wrist force/torque sensing, GelSight tactile sensing, and free-space constraints within a Bayesian inference framework that maintains a belief over object class and pose during active tactile exploration. By combining contact and non-contact evidence, the framework reduces ambiguity and improves robustness in the joint class-pose estimation problem. To enable efficient inference in the large hypothesis space, we employ a customized particle filter that progressively samples particles based on new observations. The inferred belief is further used to guide active exploration by selecting informative next touches under reachability constraints. For effective data collection, a motion planning and control framework is developed to plan and execute feasible paths for tactile exploration, handle unexpected contacts and GelSight-surface alignment with tactile servoing. We evaluate the framework in simulation and on a Franka Panda robot using 11 YCB objects. Results show that incorporating tactile and free-space information substantially improves recognition and pose estimation accuracy and stability, while reducing the number of action cycles compared with force/torque-only baselines. Code, dataset, and supplementary material will be made available online.

Haodong Zheng +5

28 days ago

arXiv 2603.21410v1

cs.CL自动驾驶端到端

A transformer architecture alteration to incentivise externalised reasoning

We propose a novel architectural modification and post-training pipeline for enhancing large language model reasoning capabilities by teaching models to truncate forward passes early. Our approach augments the standard transformer architecture with an early-exit mechanism at intermediate layers, enabling the model to exit at shallower layers when tokens can be predicted without deep computation. Through a calibration stage followed by reinforcement learning, we incentivize the model to exit as early as possible while preserving task performance. Preliminary experiments on small reasoning models demonstrate adaptive computation reduction across tokens, suggesting that at appropriate scale, this approach can minimize excess computation for non-myopic planning using internal activations, reserving deep computation only for difficult-to-predict tokens.

Anonymous

28 days ago

arXiv 2603.21376v1

cs.CV自动驾驶

Privacy-Preserving Federated Action Recognition via Differentially Private Selective Tuning and Efficient Communication

Federated video action recognition enables collaborative model training without sharing raw video data, yet remains vulnerable to two key challenges: \textit{model exposure} and \textit{communication overhead}. Gradients exchanged between clients and the server can leak private motion patterns, while full-model synchronization of high-dimensional video networks causes significant bandwidth and communication costs. To address these issues, we propose \textit{Federated Differential Privacy with Selective Tuning and Efficient Communication for Action Recognition}, namely \textit{FedDP-STECAR}. Our \textit{FedDP-STECAR} framework selectively fine-tunes and perturbs only a small subset of task-relevant layers under Differential Privacy (DP), reducing the surface of information leakage while preserving temporal coherence in video features. By transmitting only the tuned layers during aggregation, communication traffic is reduced by over 99\% compared to full-model updates. Experiments on the UCF-101 dataset using the MViT-B-16x4 transformer show that \textit{FedDP-STECAR} achieves up to \textbf{70.2\% higher accuracy} under strict privacy ($ε=0.65$) in centralized settings and \textbf{48\% faster training} with \textbf{73.1\% accuracy} in federated setups, enabling scalable and privacy-preserving video action recognition. Code available at https://github.com/izakariyya/mvit-federated-videodp

Idris Zakariyya +5

28 days ago

arXiv 2603.21305v1

cs.CV自动驾驶端到端

Enhancing Brain Tumor Classification Using Vision Transformers with Colormap-Based Feature Representation on BRISC2025 Dataset

This study presents a deep learning framework for multi-class brain tumor classification from magnetic resonance imaging scans using Vision Transformers enhanced with colormap-based feature representation. The proposed method leverages transformer architectures to capture long-range dependencies while incorporating color mapping techniques to emphasize structural and intensity variations in MRI scans. Experiments on the BRISC2025 dataset containing glioma, meningioma, pituitary tumor, and non-tumor cases demonstrate superior performance with 98.90% accuracy, significantly outperforming baseline approaches. The framework utilizes standard metrics including accuracy, precision, recall, F1-score, and AUC for comprehensive evaluation.

Anonymous Authors

28 days ago

arXiv 2603.21234v1

cs.CV自动驾驶

Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species

Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants exhibit nonrigid morphologies and physical appearance variations across growth stages and environments. To fill this gap, we present TPC-268, the first plant counting benchmark incorporating plant taxonomy. Our dataset couples instance-level point annotations with Linnaean labels (kingdom -> species) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The dataset features 10,000 images with 678,050 point annotations, includes 268 countable plant categories over 242 plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy. We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting. Dataset and code are available at https://github.com/tiny-smart/TPC-268.

Jinyu Xu +6

28 days ago

arXiv 2603.21229v1

cs.RO自动驾驶CV

Architecture for Multi-Unmanned Aerial Vehicles based Autonomous Precision Agriculture Systems

This paper presents an architectural framework for deploying multiple coordinated UAVs in precision agriculture applications. The system addresses key challenges including autonomous mission planning, coordinated data acquisition, and efficient image processing for field analysis. Various computational tasks such as path planning, communication protocols, and field mapping are integrated to enable minimal human intervention during agricultural operations. The architecture provides a comprehensive end-to-end solution supporting cooperative multi-UAV deployment, optimized for precision agriculture monitoring and intervention tasks.

Unknown

28 days ago

arXiv 2603.21183v1