Transformer

arXiv 2603.23443v1

cs.CV自动驾驶端到端

Rectify, Don't Regret: Avoiding Pitfalls of Differentiable Simulation in Trajectory Prediction

This paper addresses critical challenges in autonomous driving trajectory prediction where minor initial deviations in open-loop models cascade into compounding errors, leading to out-of-distribution states. The authors identify a shortcut learning problem in differentiable closed-loop simulators where gradients inadvertently leak future ground truth information into previous predictions, causing non-causal regret instead of genuine recovery. To solve this, they propose a detached receding horizon rollout that severs computation graphs between simulation steps, forcing the model to learn authentic reactive recovery behaviors from drifted states. Comprehensive evaluations on the nuScenes and DeepScenario autonomous driving datasets demonstrate the effectiveness of their approach in achieving genuine trajectory rectification without temporal information leakage.

arXiv 2603.23393v1

cs.CVCVTransformer

ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

ABot-PhysWorld is a 14B Diffusion Transformer model designed for interactive world modeling in robotics that generates visually realistic, physically plausible, and action-controllable videos. The model addresses common physical implausibility issues like object penetration and anti-gravity motion by using a novel DPO-based post-training framework with decoupled discriminators trained on a curated dataset of three million physics-aware manipulation clips. A parallel context block enables precise spatial action injection for cross-embodiment robot control. To evaluate generalization, the system introduces EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations.

ABot Team

arXiv 2603.23376v1

cs.CL大语言模型端到端

Off-Policy Value-Based Reinforcement Learning for Large Language Models

This paper addresses the critical challenge of improving data utilization efficiency in reinforcement learning for large language models. The authors propose ReVal, a novel Bellman-update-based method that enables off-policy learning through combining stepwise consistency signals with trajectory-level outcome verification. By supporting replay-buffer-based training, ReVal allows efficient reuse of past trajectories, significantly improving sample efficiency compared to on-policy approaches. Experimental results on mathematical reasoning benchmarks, particularly with DeepSeek-R1-Distill-1.5B, demonstrate that ReVal achieves faster convergence and superior final performance, with improvements of 2.7% on AIME24.

ReVal Team

arXiv 2603.23355v1

cs.CL大语言模型Transformer

MemCollab: Cross-Agent Memory Collaboration via Contrastive Trajectory Distillation

Large language model (LLM)-based agents rely on memory mechanisms to reuse knowledge from past problem-solving experiences. Existing approaches typically construct memory in a per-agent manner, tightly coupling stored knowledge to a single model's reasoning style. In modern deployments with heterogeneous agents, a natural question arises: can a single memory system be shared across different models? We found that naively transferring memory between agents often degrades performance, as such memory entangles task-relevant knowledge with agent-specific biases. To address this challenge, we propose MemCollab, a collaborative memory framework that constructs agent-agnostic memory by contrasting reasoning trajectories generated by different agents on the same task.

MemCollab Authors

arXiv 2603.23234v1

cs.CVCVTransformer

Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models

This paper investigates Sim-to-Real generalization for dexterous manipulation tasks using Vision-Language-Action (VLA) models. The study empirically examines key factors affecting transfer from simulation to real-world deployment, including multi-level domain randomization, photorealistic rendering, physics-realistic modeling, and reinforcement learning updates. A comprehensive evaluation protocol is designed to quantify real-world manipulation performance, providing insights for developing generalist robot control policies that can effectively bridge the simulation-to-reality gap in dexterous manipulation scenarios.

arXiv 2603.22876v1

cs.CV大语言模型端到端

End-to-End Training for Unified Tokenization and Latent Denoising

Latent diffusion models enable high-fidelity synthesis by operating in learned latent spaces. However, current approaches require complex multi-stage training where tokenizers must be trained separately before diffusion models. This paper proposes UNITE, an autoencoder architecture that unifies tokenization and latent diffusion through a single Generative Encoder with shared weights. The key innovation is treating tokenization and generation as the same latent inference problem under different conditioning regimes. The method introduces a single-stage training procedure that jointly optimizes both tokenization and generation tasks via two forward passes, enabling gradients to jointly shape the latent space for improved visual representation learning.

UNITE Authors

arXiv 2603.22283v1

cs.CVCVTransformer

DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

Vision-Language-Action (VLA) models enable robots to map visual observations and language instructions directly to robotic actions. However, existing VLA models struggle with complex multi-step tasks requiring logical planning and precise manipulations. Current Chain-of-Thought approaches have limitations in simultaneously capturing low-level visual details and high-level logical planning, as well as suffering from high inference latency with compounding errors. This paper proposes DualCoT-VLA, a novel visual-linguistic CoT method with parallel reasoning mechanism that integrates visual CoT for comprehensive multi-modal reasoning, enabling robots to perform effective thinking before acting for manipulation tasks.

DualCoT-VLA Authors

arXiv 2603.22280v1

cs.CV大语言模型Transformer

3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing

Large Language Models and Vision Language Models demonstrate strong general reasoning capabilities but face challenges in spatial understanding and layout consistency for fine-grained visual editing tasks. This paper presents a Structured Reasoning framework that enables text-conditioned spatial layout editing through scene-graph reasoning. The system takes an input scene graph and natural-language instruction, then reasons over the graph structure to generate an updated scene graph satisfying the text condition while maintaining spatial coherence. By leveraging structured relational representations, the approach enhances both interpretability and control over spatial relationships. Evaluations on a text-guided layout editing benchmark covering sorting, spatial alignment, and room-editing tasks show that the training paradigm achieves an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning baselines.

3D-Layout-R1 Authors

arXiv 2603.22279v1

cs.CV大语言模型CV

The Dual Mechanisms of Spatial Reasoning in Vision-Language Models

This paper investigates how vision-language models associate objects with their properties and spatial relations in multimodal tasks like image captioning and visual question answering. The research reveals that VLMs employ two concurrent mechanisms for spatial reasoning: the language model backbone represents content-independent spatial relations on visual tokens, while the vision encoder encodes object layouts that are directly utilized by the language model. The dominant spatial information originates from the vision encoder and is distributed globally across visual tokens, extending beyond object regions. These findings provide insights into the internal workings of multimodal models for spatial understanding.

arXiv 2603.22278v1

cs.CL大语言模型Transformer

Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction for more efficient LLM fine-tuning. However, the computational overhead of computing row-wise norms of the adapted weight matrices creates significant memory challenges, especially at high ranks and across hundreds of adapted modules. This work introduces a factored norm decomposition that eliminates dense matrix materialization by computing squared norms through base, cross, and Gram terms with O(d_out r + r^2) complexity. Additionally, fused Triton kernels combine the four-kernel DoRA composition into a single pass, achieving approximately 4x memory traffic reduction and numerical stability in near-unity rescaling regimes. These optimizations make high-rank DoRA feasible on common single-GPU setups for large language model adaptation.

DoRA Research Team

arXiv 2603.22276v1

cs.CL大语言模型端到端

Confidence-Based Decoding is Provably Efficient for Diffusion Language Models

This paper addresses the decoding efficiency challenge in diffusion language models (DLMs), which have emerged as promising alternatives to autoregressive models for language generation. The research focuses on confidence-based decoding strategies that adaptively select and unmask tokens based on prediction confidence during the generation process. The authors develop a theoretical analysis framework specifically for an entropy sum-based strategy that continues unmasking tokens until cumulative entropy exceeds a threshold. This work provides the first formal theoretical guarantees demonstrating the efficiency of confidence-based decoding in diffusion-based text generation, offering insights into how adaptive token selection can improve sampling efficiency compared to traditional decoding approaches.

arXiv 2603.22248v1

cs.CL大语言模型Transformer

SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

This paper addresses the challenge of incomplete knowledge coverage in large language models, particularly in specialized and data-scarce domains. The authors propose SPA (Scaling Prompt-engineered Augmentation), a method that uses carefully designed prompts to generate large-scale synthetic training data for knowledge injection. Through systematic comparisons, SPA demonstrates superior performance over strong baselines. The research identifies key limitations in existing approaches: RL-based methods suffer from diversity collapse at scale, while multi-stage prompting advantages disappear after careful tuning. These findings provide valuable insights for optimizing knowledge injection strategies in language models.

Anonymous

arXiv 2603.22213v1

cs.RO端到端CV

Make Tracking Easy: Neural Motion Retargeting for Humanoid Whole-body Control

This paper addresses the challenge of transferring human motion data to humanoid robots by proposing Neural Motion Retargeting (NMR), a novel framework that transforms static geometric mapping into a dynamics-aware learned process. The approach uses Clustered-Expert Physics Refinement (CEPR) with VAE-based motion clustering to group heterogeneous movements into latent motifs, significantly reducing computational overhead for reinforcement learning experts that project and repair noisy human motion data. Through Hessian analysis, the authors demonstrate that traditional optimization-based retargeting is inherently non-convex and prone to local optima, leading to physical artifacts. By reformulating the problem as learning data distribution rather than optimizing solutions, the framework achieves smooth, physically plausible whole-body robot control.

NMR Research Team

arXiv 2603.22201v1

cs.AI大语言模型端到端

CayleyPy-4: AI-Holography. Towards analogs of holographic string dualities for AI tasks

This paper presents the fourth installment of the CayleyPy project, applying AI methods to explore large graphs through a novel discrete version of holographic string dualities. The research demonstrates that many modern AI tasks, particularly GPT-style language models and reinforcement learning systems, can be modeled as predicting particle trajectories on graphs. For Cayley graphs of the symmetric group S_n, the authors establish a dual description in terms of discrete strings and propose string holographic images as natural candidates for data embeddings, extending the complexity equals volume principle from AdS/CFT to AI applications. This work hypothesizes potential extensions of such dualities across a range of AI systems, suggesting more efficient computational approaches for language modeling and general AI tasks.

CayleyPy Project Team

arXiv 2603.22195v1

cs.CL大语言模型端到端

Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation

This paper addresses the challenge of improving document-level machine translation using Large Language Models. The authors propose a two-stage fine-tuning strategy that augments training data by converting summarization data into document-level parallel data using LLMs. To ensure data quality, they filter the synthetic corpus using multiple metrics including sacreBLEU, COMET, and LaBSE-based cosine similarity. The approach tackles two key challenges: the scarcity of large-scale document-level parallel data and the tendency of LLMs to generate hallucinations and omissions. By leveraging LLMs' strength in modeling contextual information, this method aims to improve coherence across sentences in translation tasks.

Ireh Kim +2

arXiv 2603.22186v1

cs.LG大语言模型端到端

Revisiting Quantum Code Generation: Where Should Domain Knowledge Live?

This paper investigates how to effectively incorporate domain knowledge into LLM-based code generation systems for quantum software development. The researchers evaluate various strategies including parameter-specialized fine-tuned models and general-purpose LLMs enhanced with retrieval-augmented generation and agent-based inference mechanisms. Using the Qiskit-HumanEval benchmark, they compare different approaches to quantum code generation with Qiskit frameworks. The study finds that modern general-purpose LLMs with advanced inference techniques consistently outperform specialized fine-tuned baselines, achieving approximately 47% pass@1 performance. These findings suggest that general-purpose models with retrieval and execution feedback mechanisms may be more suitable for evolving software ecosystems compared to domain-specific specialized models.

arXiv 2603.22184v1

cs.CV自动驾驶端到端

Closed-Loop Verbal Reinforcement Learning for Task-Level Robotic Planning

This paper presents a Verbal Reinforcement Learning framework for interpretable task-level planning in mobile robots operating under execution uncertainty. The approach employs a closed-loop architecture where Behavior Trees are iteratively refined by a Large Language Model, guided by structured feedback from a Vision-Language Model critic that observes robot execution. Unlike traditional reinforcement learning, policy updates occur at the symbolic planning level without gradient-based optimization, enabling transparent reasoning and human-interpretable policy evolution. The framework is validated on a real mobile robot performing multi-stage manipulation and navigation tasks, demonstrating effective handling of execution uncertainty through iterative refinement.