作者人数
标签数量
内容状态
原文 + 中文
同页查看标题和摘要的双语信息
PDF 预览
直接在详情页阅读或下载论文全文
深度分析
继续下钻到 AI 生成的结构化解读
摘要 / Abstract
This paper investigates the effectiveness of topic-matched contrast baselines in multi-directional refusal abliteration for instruction-tuned language models. The research focuses on extracting refusal-mediating directions from the residual stream activation space of the Qwen 3.5 2B model using per-category matched prompt pairs, Self-Organizing Map extraction, and Singular Value Decomposition orthogonalization. The key finding reveals that while unmatched contrast successfully achieves complete refusal elimination across tested weight levels and layers, topic-matched contrast produces no functional refusal directions. This work highlights the importance of contrast baseline construction as a critical methodological consideration rather than an implementation detail in model abliteration research.
本文研究了主题匹配对比基线在指令微调语言模型多方向拒绝消除中的有效性。研究聚焦于利用逐类别匹配提示对、自组织映射提取和奇异值分解正交化方法,从Qwen 3.5 2B模型的残差流激活空间中提取拒绝介导方向。研究发现,非匹配对比在所有测试的权重层级和层中均成功实现完全拒绝消除,而主题匹配对比则未能产生功能性拒绝方向。本工作强调对比基线构建是模型消融研究中的关键方法论考量,而非简单的实施细节。
分类 / Categories
深度分析
AI 深度理解论文内容,生成具有洞见性的总结