大语言模型

共 53 篇论文

cs.LG大语言模型端到端

Revisiting Quantum Code Generation: Where Should Domain Knowledge Live?

This paper investigates how to effectively incorporate domain knowledge into LLM-based code generation systems for quantum software development. The researchers evaluate various strategies including parameter-specialized fine-tuned models and general-purpose LLMs enhanced with retrieval-augmented generation and agent-based inference mechanisms. Using the Qiskit-HumanEval benchmark, they compare different approaches to quantum code generation with Qiskit frameworks. The study finds that modern general-purpose LLMs with advanced inference techniques consistently outperform specialized fine-tuned baselines, achieving approximately 47% pass@1 performance. These findings suggest that general-purpose models with retrieval and execution feedback mechanisms may be more suitable for evolving software ecosystems compared to domain-specific specialized models.

Anonymous Authors

arXiv 2603.22184v1

cs.AI大语言模型

MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management

Cardiovascular disease remains the leading cause of global mortality, with progress hindered by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and are non-interactive. We present MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals), an agentic vision-language system for end-to-end interpretation of electrocardiograms (ECGs), echocardiograms, and cardiac magnetic resonance imaging (CMR) independently and as multimodal input. MARCUS employs a hierarchical agentic architecture comprising modality-specific vision-language expert models, each integrating domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5 million images (0.25M ECGs, 1.3M echocardiogram images, 12M CMR images) and our novel expert-curated dataset spanning 1.6 million questions, MARCUS achieves state-of-the-art performance surpassing frontier models (GPT-5 Thinking, Gemini 2.5 Pro Deep Think). Across internal (Stanford) and external (UCSF) test cohorts, MARCUS achieves accuracies of 87-91% for ECG, 67-86% for echocardiography, and 85-88% for CMR, outperforming frontier models by 34-45% (P<0.001). On multimodal cases, MARCUS achieved 70% accuracy, nearly triple that of frontier models (22-28%), with 1.7-3.0x higher free-text quality scores. Our agentic architecture also confers resistance to mirage reasoning, whereby vision-language models derive reasoning from unintended textual signals or hallucinated visual content. MARCUS demonstrates that domain-specific visual encoders with an agentic orchestrator enable multimodal cardiac interpretation. We release our models, code, and benchmark open-source.

Jack W O'Sullivan +10

arXiv 2603.22179v1

cs.CL大语言模型Transformer

Causal Evidence that Language Models use Confidence to Drive Behavior

Metacognition, the ability to assess one's own cognitive performance, is a fundamental capability documented across species where internal confidence estimates guide adaptive behavior. This research investigates whether Large Language Models (LLMs) actively utilize confidence signals to regulate their behavior through a four-phase abstention paradigm. The study first establishes internal confidence estimates without abstention options, then reveals that LLMs apply implicit thresholds to these estimates when deciding whether to answer or abstain. Findings demonstrate that confidence serves as the dominant predictor of behavior, with effect sizes an order of magnitude larger than knowledge retrieval accessibility or semantic features. Causal evidence is provided through activation steering experiments, where manipulating internal confidence signals correspondingly shifts abstention rates, demonstrating a direct causal relationship between confidence estimation and behavioral regulation.

arXiv 2603.22161v1

cs.LG大语言模型

Multimodal Survival Analysis with Locally Deployable Large Language Models

We study multimodal survival analysis integrating clinical text, tabular covariates, and genomic profiles using locally deployable large language models (LLMs). As many institutions face tight computational and privacy constraints, this setting motivates the use of lightweight, on-premises models. Our approach jointly estimates calibrated survival probabilities and generates concise, evidence-grounded prognosis text via teacher-student distillation and principled multimodal fusion. On a TCGA cohort, it outperforms standard baselines, avoids reliance on cloud services and associated privacy concerns, and reduces the risk of hallucinated or miscalibrated estimates that can be observed in base LLMs.

Moritz Gögl +1

arXiv 2603.22158v1

cs.CV大语言模型CV

dynActivation: A Trainable Activation Family for Adaptive Nonlinearity

This paper introduces dynActivation, a per-layer trainable activation function that dynamically interpolates between base nonlinearities and linear paths using lightweight learned scalars. The proposed method is evaluated across vision tasks (CIFAR-10, MNIST) and language modeling tasks, demonstrating significant improvements in training efficiency (up to 54% faster) and performance. On CIFAR-10, dynActivation(Mish) achieves up to 14.02% improvement over static Mish, with 24% reduction in convergence time. In deep network scaling experiments (up to 75 layers), dynActivation maintains robust performance (95.3-99.3% accuracy) while ReLU collapses below 80%, demonstrating that adaptive nonlinearity linearization in deep layers enhances both training stability and final model quality.

dynActivation Authors

arXiv 2603.22154v1

cs.CV大语言模型

Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding

Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise. Our framework is agnostic to base retrieval models and widely applicable for multimodal VMR. Experimental evaluations on the TVR benchmark demonstrate significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding.

arXiv 2603.22121v1

cs.LG大语言模型

On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference $Δ\log p$ between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that $Δ\log p$ more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a \textit{test-time extrapolation} method that amplifies the policy along the learned $Δ\log p$ direction to improve reasoning accuracy without further training; (2) a \textit{training-time reweighting} method that focuses learning on low-probability (corresponding to higher $Δ\log p$) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.

Kexin Huang +12

arXiv 2603.22117v1

cs.CV大语言模型

Multiperspectivity as a Resource for Narrative Similarity Prediction

Predicting narrative similarity can be understood as an inherently interpretive task: different, equally valid readings of the same text can produce divergent interpretations and thus different similarity judgments, posing a fundamental challenge for semantic evaluation benchmarks that encode a single ground truth. Rather than treating this multiperspectivity as a challenge to overcome, we propose to incorporate it in the decision making process of predictive systems. To explore this strategy, we created an ensemble of 31 LLM personas. These range from practitioners following interpretive frameworks to more intuitive, lay-style characters. Our experiments were conducted on the SemEval-2026 Task 4 dataset, where the system achieved an accuracy score of 0.705. Accuracy improves with ensemble size, consistent with Condorcet Jury Theorem-like dynamics under weakened independence. Practitioner personas perform worse individually but produce less correlated errors, yielding larger ensemble gains under majority voting. Our error analysis reveals a consistent negative association between gender-focused interpretive vocabulary and accuracy across all persona categories, suggesting either attention to dimensions not relevant for the benchmark or valid interpretations absent from the ground truth. This finding underscores the need for evaluation frameworks that account for interpretive plurality.

Max Upravitelev +6

arXiv 2603.22103v1

cs.AI大语言模型

GSEM: Graph-based Self-Evolving Memory for Experience Augmented Clinical Reasoning

Clinical decision-making agents can benefit from reusing prior decision experience. However, many memory-augmented methods store experiences as independent records without explicit relational structure, which may introduce noisy retrieval, unreliable reuse, and in some cases even hurt performance compared to direct LLM inference. We propose GSEM (Graph-based Self-Evolving Memory), a clinical memory framework that organizes clinical experiences into a dual-layer memory graph, capturing both the decision structure within each experience and the relational dependencies across experiences, and supporting applicability-aware retrieval and online feedback-driven calibration of node quality and edge weights. Across MedR-Bench and MedAgentsBench with two LLM backbones, GSEM achieves the highest average accuracy among all baselines, reaching 70.90\% and 69.24\% with DeepSeek-V3.2 and Qwen3.5-35B, respectively. Code is available at https://github.com/xhan1022/gsem.

arXiv 2603.22096v1

cs.AI大语言模型

A Context Engineering Framework for Improving Enterprise AI Agents based on Digital-Twin MDP

Despite rapid progress in AI agents for enterprise automation and decision-making, their real-world deployment and further performance gains remain constrained by limited data quality and quantity, complex real-world reasoning demands, difficulties with self-play, and the lack of reliable feedback signals. To address these challenges, we propose a lightweight, model-agnostic framework for improving LLM-based enterprise agents via offline reinforcement learning (RL). The proposed Context Engineering via DT-MDP (DT-MDP-CE) framework comprises three key components: (1) A Digital-Twin Markov Decision Process (DT-MDP), which abstracts the agent's reasoning behavior as a finite MDP; (2) A robust contrastive inverse RL, which, armed with the DT-MDP, to efficiently estimate a well-founded reward function and induces policies from mixed-quality offline trajectories; and (3) RL-guided context engineering, which uses the policy obtained from the integrated process of (1) and (2), to improve the agent's decision-making behavior. As a case study, we apply the framework to a representative task in the enterprise-oriented domain of IT automation. Extensive experimental results demonstrate consistent and significant improvements over baseline agents across a wide range of evaluation settings, suggesting that the framework can generalize to other agents sharing similar characteristics in enterprise environments.

arXiv 2603.22083v1

cs.CL大语言模型Transformer

Autoregressive vs. Masked Diffusion Language Models: A Controlled Comparison

This paper presents a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models trained on identical data (50M tokens from TinyStories), compute budget (20,000 steps, batch size 32, sequence length 512), and hardware (NVIDIA H100 80GB). The study reveals three key findings: both paradigms achieve comparable training throughput (~50K tokens/second), AR converges faster but overfits earlier while MDLM improves more gradually, and there exists a structural diversity-fluency trade-off with AR producing fluent but repetitive outputs and MDLM generating more diverse narratives.

Anonymous Authors

arXiv 2603.22075v1

cs.CL大语言模型端到端

On the Failure of Topic-Matched Contrast Baselines in Multi-Directional Refusal Abliteration

This paper investigates the effectiveness of topic-matched contrast baselines in multi-directional refusal abliteration for instruction-tuned language models. The research focuses on extracting refusal-mediating directions from the residual stream activation space of the Qwen 3.5 2B model using per-category matched prompt pairs, Self-Organizing Map extraction, and Singular Value Decomposition orthogonalization. The key finding reveals that while unmatched contrast successfully achieves complete refusal elimination across tested weight levels and layers, topic-matched contrast produces no functional refusal directions. This work highlights the importance of contrast baseline construction as a critical methodological consideration rather than an implementation detail in model abliteration research.

arXiv 2603.22061v1

cs.CL大语言模型端到端

Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

This paper addresses the challenge of knowledge distillation between large language models with different tokenizers. The authors systematically analyze the attention mechanism of Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA), revealing its strengths and limitations through token alignment probing and visualization. They propose DSKD-CMA-GA, a novel method leveraging Generative Adversarial learning to better align mismatched key-query distributions across models. Experimental results demonstrate consistent improvements in text generation quality as measured by ROUGE-L scores, offering a more transparent and effective approach to compressing large language models for efficient deployment.

arXiv 2603.22056v1

cs.CV大语言模型

Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models

While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: https://github.com/jeeit17/UNCHA.git.

arXiv 2603.22042v1

cs.LG大语言模型

AdditiveLLM2: A Multi-modal Large Language Model for Additive Manufacturing

This work presents AdditiveLLM2 a multi-modal, domain adapted large language model built upon the instruction tuned variant of the Gemma 3 model using a relatively small dataset of around 50 million tokens. The dataset (AdditiveLLM2-OA) consists of open-access additive manufacturing journal articles with data extracted for the domain adaptive pretraining and visual instruction tuning processes. Various stages of the developed model are evaluated with the Additive-Manufacturing-Benchmark which consists of additive manufacturing domain specific tasks compiled published resources. AdditiveLLM2 exhibits proficiency in both language and vision based tasks, achieving accuracies upwards of 90% in general additive manufacturing knowledge. This domain adaptive pretraining and instruction tuning strategy outline an accessible specialization method for large language models to a domain such as additive manufacturing.

arXiv 2603.22017v1

cs.LG大语言模型

ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention

Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem. ROM attaches a lightweight detection head to the late-layer hidden states of a frozen large language model backbone. It monitors tokens in real time and triggers an early transition to the final answer once overthinking is detected. We also introduce token-level supervision based on solution correctness boundaries and a data augmentation strategy that reduces distilled-data bias. Across seven benchmarks, ROM achieves the highest accuracy (93.51%), the shortest responses (1,159 tokens), and the best response efficiency. Compared with the vanilla baseline, it reduces response length by 47.2% and improves efficiency by 121%. These results show that streaming detection is a promising approach to real-time overthinking mitigation.

arXiv 2603.22016v1

cs.CL大语言模型端到端

Retrieving Climate Change Disinformation by Narrative

This paper addresses the challenge of detecting climate disinformation narratives by re-framing traditional detection as a retrieval task. Rather than relying on fixed taxonomies that cannot accommodate emerging narratives, the approach ranks texts by alignment with a given narrative core message. The proposed SpecFi framework generates hypothetical documents to bridge abstract narrative descriptions with concrete textual instantiations, using community summaries from graph-based detection as few-shot examples for generation. The method achieves a MAP of 0.505 on the CARDS dataset without accessing narrative labels. Additionally, the paper introduces narrative variance, an embedding-based difficulty metric, and demonstrates its utility through partial correlation analysis for understanding narrative complexity in climate disinformation detection.

Anonymous Authors

arXiv 2603.22015v1

cs.IR大语言模型

On the Challenges and Opportunities of Learned Sparse Retrieval for Code

Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.

Simon Lupart +4

arXiv 2603.22008v1

cs.AI大语言模型端到端

TREX: Trajectory Explanations for Multi-Objective Reinforcement Learning

This paper presents TREX, a Trajectory-based Explainability framework designed for Multi-Objective Reinforcement Learning (MORL). The work addresses the limitation that traditional Explainable Reinforcement Learning (XRL) methods are typically tailored for single scalar rewards and fail to provide explanations when agents optimize multiple conflicting objectives simultaneously. The proposed approach enables agents to explicitly reason about trade-offs between different objectives and generates interpretable explanations for the decision-making process behind objective trade-offs. By focusing on trajectory-level explanations, TREX provides insights into how agents navigate decision spaces when balancing competing objectives in complex real-world scenarios.

Multiple Authors

arXiv 2603.21988v1

cs.CL大语言模型端到端

SecureBreak: A Dataset Towards Safe and Secure Models

Large language models are becoming pervasive core components in many real-world applications. As a consequence, security alignment represents a critical requirement for their safe deployment. Although previous related works focused primarily on model architectures and alignment methodologies, these approaches alone cannot ensure the complete elimination of harmful generations. This concern is reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies are needed both to provide qualitative feedback on the robustness of the obtained security alignment at the training stage, and to create an ultimate defense layer to block unsafe outputs possibly produced by deployed models. To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven security solutions for language models.

SecureBreak Team

arXiv 2603.21975v1

上一页第 2 / 3 页下一页