自动驾驶

共 56 篇论文

cs.CV自动驾驶CV

RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios

RoadBench is a comprehensive benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on fine-grained spatial understanding and reasoning tasks specifically in urban road scenarios. The benchmark covers diverse driving conditions including intersections, parking lots, and complex road layouts to assess models' capabilities in spatial perception, object localization, and scene comprehension. It provides standardized evaluation metrics and extensive testing scenarios to advance the development of AI systems for autonomous driving applications. The benchmark includes both perception-level and reasoning-level tasks to comprehensively measure the performance of multimodal models in understanding complex road environments.

RoadBench Team

20 days ago

本地论文

cs.CV自动驾驶CV

RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios

This paper presents RoadBench, a comprehensive benchmark designed to evaluate Multi-Modal Large Language Models (MLLMs) on fine-grained spatial understanding and reasoning tasks in urban road scenarios. The benchmark includes diverse urban road images, detailed annotations, and challenging questions that require precise spatial perception and reasoning. RoadBench aims to address the gap in existing benchmarks that lack focus on fine-grained spatial understanding and reasoning under complex urban road conditions. Experimental results demonstrate that current MLLMs still face significant challenges in handling fine-grained spatial understanding and reasoning in urban road scenarios.

Unknown

20 days ago

本地论文

cs.CV自动驾驶端到端

Rectify, Don't Regret: Avoiding Pitfalls of Differentiable Simulation in Trajectory Prediction

This paper addresses critical challenges in autonomous driving trajectory prediction where minor initial deviations in open-loop models cascade into compounding errors, leading to out-of-distribution states. The authors identify a shortcut learning problem in differentiable closed-loop simulators where gradients inadvertently leak future ground truth information into previous predictions, causing non-causal regret instead of genuine recovery. To solve this, they propose a detached receding horizon rollout that severs computation graphs between simulation steps, forcing the model to learn authentic reactive recovery behaviors from drifted states. Comprehensive evaluations on the nuScenes and DeepScenario autonomous driving datasets demonstrate the effectiveness of their approach in achieving genuine trajectory rectification without temporal information leakage.

Anonymous Authors

26 days ago

arXiv 2603.23393v1

cs.CV自动驾驶CV

YOLOv10 with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection and trustworthy multimodal AI in computer vision perception

This paper presents an interpretable object detection framework using Kolmogorov-Arnold networks to enhance trustworthiness in autonomous vehicle perception systems. The approach addresses the critical limitation of limited transparency in confidence scores during visually degraded or ambiguous driving scenarios. A Kolmogorov-Arnold network serves as an interpretable post-hoc surrogate model for YOLOv10 detections, utilizing seven geometric and semantic features to assess detection reliability. The additive spline-based architecture enables direct visualization of feature contributions, revealing when confidence scores are well-supported versus unreliable. Experimental validation on COCO dataset and University of Bath campus images demonstrates accurate trustworthiness estimation for autonomous driving perception.

Unknown

26 days ago

arXiv 2603.23037v1

cs.CV自动驾驶CV

Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction

This paper presents Gau-Occ, a multi-modal framework for 3D semantic occupancy prediction in autonomous driving that models scenes as compact collections of semantic 3D Gaussians, bypassing traditional dense volumetric processing. The proposed LiDAR Completion Diffuser recovers missing structures from sparse LiDAR point clouds to initialize robust Gaussian anchors, while Gaussian Anchor Fusion efficiently integrates multi-view image semantics through geometry-aligned 2D sampling and cross-modal alignment. By refining these compact Gaussian descriptors, the method achieves both spatial consistency and semantic discriminability for comprehensive 3D scene understanding in autonomous vehicles.

Chengxin Lv +3

26 days ago

arXiv 2603.22852v1

cs.CV自动驾驶CV

Feasibility of Augmented Reality-Guided Robotic Ultrasound with Cone-Beam CT Integration for Spine Procedures

This paper presents an optical see-through augmented reality (OST-AR)-guided robotic system for spine procedures, integrating cone-beam CT (CBCT)-derived 3D spine models with live ultrasound for enhanced needle trajectory planning. The system enables in situ visualization of spinal structures, combining global anatomical context from CBCT with local real-time ultrasound feedback. A phantom user study with 16 participants evaluated facet joint injection and lumbar puncture procedures under AR vs. conventional screen visualization conditions, demonstrating the feasibility of AR-guided medical robotics for precise needle placement.

自动驾驶

RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios

RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios

Rectify, Don't Regret: Avoiding Pitfalls of Differentiable Simulation in Trajectory Prediction

YOLOv10 with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection and trustworthy multimodal AI in computer vision perception

Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction

Feasibility of Augmented Reality-Guided Robotic Ultrasound with Cone-Beam CT Integration for Spine Procedures

Closed-Loop Verbal Reinforcement Learning for Task-Level Robotic Planning

OpenEarth-Agent: From Tool Calling to Tool Creation for Open-Environment Earth Observation

Do World Action Models Generalize Better than VLAs? A Robustness Study

MineRobot: A Unified Framework for Kinematics Modeling and Solving of Underground Mining Robots in Virtual Environments

Future-Interactions-Aware Trajectory Prediction via Braid Theory

LRC-WeatherNet: LiDAR, RADAR, and Camera Fusion Network for Real-time Weather-type Classification in Autonomous Driving

The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation

Disengagement Analysis and Field Tests of a Prototypical Open-Source Level 4 Autonomous Driving System

IGV-RRT: Prior-Real-Time Observation Fusion for Active Object Search in Changing Environments

Optimal Solutions for the Moving Target Vehicle Routing Problem with Obstacles via Lazy Branch and Price

Directional Mollification for Controlled Smooth Path Generation

Multi-View Deformable Convolution Meets Visual Mamba for Coronary Artery Segmentation

Partial Attention in Deep Reinforcement Learning for Safe Multi-Agent Control

Memory-Efficient Boundary Map for Large-Scale Occupancy Grid Mapping