返回论文列表
Paper Detail
The Autonomy Tax: Defense Training Breaks LLM Agents自主性代价:防御训练对LLM智能体的损害
cs.CL端到端Transformer热门获取具身智能多模态
Anonymous Authors
2026年03月20日
arXiv: 2603.19423v1

作者人数

1

标签数量

5

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

Large language model agents increasingly rely on external tools such as file operations, API calls, and database transactions to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations. This paper reveals a fundamental capability-alignment paradox where defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents including agent incompetence bias manifesting as immediate tool execution breakdown and cascade amplification bias causing early failures to propagate through retry loops. These findings demonstrate that current defense training approaches create significant trade-offs between safety and functional capability in autonomous LLM agents.

大语言模型智能体越来越多地依赖文件操作、API调用和数据库事务等外部工具来自主完成复杂多步骤任务。本文揭示了一个基本的能力-对齐悖论:旨在提高安全性的防御训练在系统性破坏智能体能力的同时,无法有效阻止复杂攻击。通过在97个智能体任务和1000个对抗性提示词上评估防御模型与未防御基线,我们发现了多步骤智能体特有的三种系统性偏见,包括智能体能力不足偏见、级联放大偏见等。这些发现表明,当前的防御训练方法在自主性LLM智能体的安全性与功能性之间造成了显著的权衡困境。

PDF 预览
1
在 arXiv 查看下载 PDF

分类 / Categories

cs.CLcs.AIcs.CR

深度分析

AI 深度理解论文内容,生成具有洞见性的总结