大语言模型

共 53 篇论文

cs.AI大语言模型端到端

Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe

This paper presents a systematic empirical study on scaling Reinforcement Learning for Large Language Model agents in complex, multi-turn environments. Using TravelPlanner as a testbed, the authors decompose the agentic RL design space along five critical axes including reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Their controlled experiments reveal key insights: reward and algorithm choices are scale-dependent with smaller models benefiting from staged rewards while larger models converge with simpler dense rewards, approximately 1K training samples with balanced difficulty mixture represents the optimal training budget, and environmental stability is crucial for preventing policy degradation. The work provides a practical recipe for developing autonomous LLM agents capable of long-horizon tool orchestration and planning.

TravelPlanner Research Team

27 days ago

arXiv 2603.21972v1

cs.CL大语言模型

Parameter-Efficient Fine-Tuning for Medical Text Summarization: A Comparative Study of Lora, Prompt Tuning, and Full Fine-Tuning

Fine-tuning large language models for domain-specific tasks such as medical text summarization demands substantial computational resources. Parameter-efficient fine-tuning (PEFT) methods offer promising alternatives by updating only a small fraction of parameters. This paper compares three adaptation approaches-Low-Rank Adaptation (LoRA), Prompt Tuning, and Full Fine-Tuning-across the Flan-T5 model family on the PubMed medical summarization dataset. Through experiments with multiple random seeds, we demonstrate that LoRA consistently outperforms full fine-tuning, achieving 43.52 +/- 0.18 ROUGE-1 on Flan-T5-Large with only 0.6% trainable parameters compared to 40.67 +/- 0.21 for full fine-tuning. Sensitivity analyses examine the impact of LoRA rank and prompt token count. Our findings suggest the low-rank constraint provides beneficial regularization, challenging assumptions about the necessity of full parameter updates. Code is available at https://github.com/eracoding/llm-medical-summarization

Ulugbek Shernazarov +2

27 days ago

arXiv 2603.21970v1

cs.CL大语言模型端到端

SLURP-TN: Resource for Tunisian Dialect Spoken Language Understanding

This paper introduces SLURP-TN, a comprehensive spoken language understanding resource specifically designed for the Tunisian dialect. The dataset comprises 4165 sentences recorded from 55 native speakers, totaling approximately 5 hours of acoustic material. By translating sentences from six SLURP domains, the authors address the critical gap in SLU resources for low-resource languages. The research develops baseline Automatic Speech Recognition and SLU models that leverage deep neural networks and pre-trained language models to extract semantic information from speech utterances in task-oriented dialogue systems. This work enables the Tunisian-speaking population to benefit from recent advances in natural language processing and speech recognition technology.

Elyadata Research Team

27 days ago

arXiv 2603.21940v1

cs.CV大语言模型CV

Guideline-grounded retrieval-augmented generation for ophthalmic clinical decision support

This paper presents Oph-Guid-RAG, a multimodal visual retrieval-augmented generation system designed for ophthalmology clinical decision support. The system treats guideline pages as independent evidence units and retrieves page images while preserving visual elements like tables and flowcharts. It implements a controllable retrieval framework with routing and filtering mechanisms to reduce noise while selectively introducing external evidence. The system combines query decomposition, rewriting, retrieval, reranking, and multimodal reasoning to provide traceable outputs with guideline references. Evaluated on HealthBench with doctor-based scoring, the approach significantly outperforms GPT-5.2 and GPT-5.4 on hard subsets, achieving +30.0% improvement in overall score and +10.4% to +24.4% gains in accuracy.

Oph-Guid-RAG Team

27 days ago

arXiv 2603.21925v1

cs.CL大语言模型端到端

P^2O: Joint Policy and Prompt Optimization

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting hard samples that yield near-zero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value of these instances. To address this, we propose P^2O, a novel framework that synergizes Prompt Optimization with Policy Optimization. P^2O identifies hard samples during training iterations and leverages the GeneticPareto (GEPA) prompt optimization algorithm to evolve prompt templates that guide the model toward discovering successful trajectories.

P^2O Authors

27 days ago

arXiv 2603.21877v1

cs.CV大语言模型

Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at https://dungeonmassster.github.io/SAGE-GRPO-Page/.

Mingzhe Zheng +11

27 days ago

arXiv 2603.21872v1

cs.CL大语言模型Transformer

Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization

Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. This paper proposes a reusable framework for holistic MoE architectural optimization that addresses limitations in existing approaches. The authors demonstrate that FLOPs per token alone is insufficient for fair comparison of MoE models because varying computational densities across layer types can inflate parameters without proportional compute cost. They establish a joint constraint triad of FLOPs per token, active parameters, and total parameters, then reduce the 16-dimensional architectural search space to two sequential low-dimensional phases through algebraic constraints.

Anonymous

27 days ago

arXiv 2603.21862v1

cs.CL大语言模型Transformer

Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models

This paper investigates whether large language models demonstrate genuine moral reasoning capabilities or merely produce superficially convincing reasoning-like outputs. The study analyzes responses from 13 different LLMs across six classical moral dilemmas using Kohlberg's stages of moral development as an evaluation framework. Through an LLM-as-judge scoring pipeline validated across three judge models, over 600 responses were classified and analyzed. The research reveals a significant finding that LLM responses predominantly exhibit post-conventional reasoning patterns (Stages 5-6), which contradicts typical human developmental trajectories where such reasoning emerges later in moral development. This inversion suggests that alignment training may produce outputs that mimic advanced moral reasoning without the underlying developmental progression characteristic of human moral cognition.

Researchers

27 days ago

arXiv 2603.21854v1

cs.CL大语言模型

Riding Brainwaves in LLM Space: Understanding Activation Patterns Using Individual Neural Signatures

Consumer-grade EEG is entering everyday devices, from earbuds to headbands, raising the question of whether language models can be adapted to individual neural responses. We test this by asking whether frozen LLM representations encode person-specific EEG signals, directions in activation space that predict one person's brain activity but not another's. Using word-level EEG from 30 participants reading naturalistic sentences (ZuCo corpus), we train a separate linear probe for each person, mapping hidden states from a frozen Qwen 2.5 7B to that individual's EEG power. Person-specific probes outperform a single population probe on every EEG feature tested; for high-gamma power, the person-specific probe achieves rho = 0.183, a ninefold improvement over the population probe (rho = 0.020, p < 10^-4). A negative control, fixation count, shows no person-specific advantage (p = 0.360); fixation count reflects word length and frequency rather than individual cognition. The individual directions are temporally stable (split-half cosine = 0.824), non-transferable across people (self rho = 0.369 vs. other rho = 0.143, p < 10^-19), and distinct from the shared population signal: person-specific probes retain predictive power after the population component is removed. The person-specific signal concentrates in the model's deep layers, rising consistently with depth and peaking at Layer 24 of 28. The results are consistent across architectures (LLaMA 3.1 8B) and survive word-level confound controls. Frozen language models contain stable, person-specific neural directions in their deep layers, providing a geometric foundation for EEG-driven personalization.

Ajan Subramanian +2

27 days ago

arXiv 2603.21847v1

cs.CV大语言模型自动驾驶

KLDrive: Fine-Grained 3D Scene Reasoning for Autonomous Driving based on Knowledge Graph

This paper presents KLDrive, a novel knowledge-graph-augmented large language model reasoning framework specifically designed for fine-grained question answering in autonomous driving scenarios. The framework addresses critical challenges in autonomous driving perception by consolidating multi-source evidence through an energy-based scene fact construction module that builds reliable scene knowledge graphs. A specialized LLM agent performs fact-grounded reasoning over constrained action spaces using explicit structural constraints, combining structured prompting with few-shot in-context exemplars to adapt to diverse driving reasoning tasks. This approach tackles issues of unreliable scene facts, hallucinations, and opaque reasoning found in existing perception pipelines and driving-oriented LLM methods.

KLDrive Authors

28 days ago

arXiv 2603.21029v1

cs.CL大语言模型端到端

Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

Large language models often exhibit selection bias in multiple-choice and pairwise evaluation tasks due to non-semantic factors such as option positions and label symbols. Existing inference-time debiasing methods are computationally expensive and may harm reasoning capabilities, while pointwise training approaches ignore the importance of consistent answers across different question permutations. This paper proposes Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which addresses selection bias by enforcing permutation-consistent semantic reasoning through two complementary mechanisms: cross-permutation advantage computation relative to the mean reward over all permutations, and consistency-aware reward that encourages stable decision-making across different permutations. The proposed approach effectively mitigates selection bias while preserving the model's reasoning capabilities.

PA-GRPO Authors

28 days ago

arXiv 2603.21016v1

cs.CL大语言模型端到端

RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation

Large language models are increasingly evaluated using automated graders that output scalar scores or preferences, but these approaches lack interpretability as a single score cannot explain why an answer is good or bad. Rubric-based evaluation offers a more transparent alternative by decomposing quality into explicit, checkable criteria. However, manually designing high-quality, query-specific rubrics is labor-intensive and cognitively demanding. This work investigates whether LLMs can generate interpretable and effective rubrics through domain knowledge retrieval for automated evaluation. The proposed RubricRAG framework aims to enhance the interpretability and reliability of LLM evaluation by leveraging retrieved domain knowledge to generate task-specific evaluation rubrics.

RubricRAG Authors

29 days ago

arXiv 2603.20882v1

cs.CV大语言模型

Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.

Huan Zheng +8

29 days ago

arXiv 2603.20698v1

上一页第 3 / 3 页