摘要 / Abstract

Large language model agents increasingly rely on external tools such as file operations, API calls, and database transactions to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations. This paper reveals a fundamental capability-alignment paradox where defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents including agent incompetence bias manifesting as immediate tool execution breakdown and cascade amplification bias causing early failures to propagate through retry loops. These findings demonstrate that current defense training approaches create significant trade-offs between safety and functional capability in autonomous LLM agents.

摘要 / Abstract

分类 / Categories

深度分析