Paper Detail

On the Failure of Topic-Matched Contrast Baselines in Multi-Directional Refusal Abliteration关于多方向拒绝消除中主题匹配对比基线失效的研究

cs.CL大语言模型端到端Transformer热门获取

Anonymous

2026年03月23日

arXiv: 2603.22061v1

作者人数

1

标签数量

4

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

This paper investigates the effectiveness of topic-matched contrast baselines in multi-directional refusal abliteration for instruction-tuned language models. The research focuses on extracting refusal-mediating directions from the residual stream activation space of the Qwen 3.5 2B model using per-category matched prompt pairs, Self-Organizing Map extraction, and Singular Value Decomposition orthogonalization. The key finding reveals that while unmatched contrast successfully achieves complete refusal elimination across tested weight levels and layers, topic-matched contrast produces no functional refusal directions. This work highlights the importance of contrast baseline construction as a critical methodological consideration rather than an implementation detail in model abliteration research.

摘要 / Abstract

分类 / Categories

深度分析