摘要 / Abstract

Large language models (LLMs) are increasingly deployed in human-AI teams as support agents for complex tasks such as information retrieval, programming, and decision-making assistance. This paper studies the potential role of LLMs as defensive supervisors within mixed human-AI teams to detect malicious behavior. Using a dataset consisting of multi-party conversations and decisions over a 25-round horizon, we formulate the problem of malicious behavior detection from interaction traces. We find that LLMs are capable of identifying malicious behavior effectively, demonstrating their potential as defensive actors in collaborative environments.

摘要 / Abstract

分类 / Categories

深度分析