Paper Detail

Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch基于键-查询匹配的双空间知识蒸馏方法用于大语言模型词汇不匹配问题

cs.CL大语言模型端到端Transformer热门获取

Anonymous

2026年03月23日

arXiv: 2603.22056v1

作者人数

1

标签数量

4

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

This paper addresses the challenge of knowledge distillation between large language models with different tokenizers. The authors systematically analyze the attention mechanism of Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA), revealing its strengths and limitations through token alignment probing and visualization. They propose DSKD-CMA-GA, a novel method leveraging Generative Adversarial learning to better align mismatched key-query distributions across models. Experimental results demonstrate consistent improvements in text generation quality as measured by ROUGE-L scores, offering a more transparent and effective approach to compressing large language models for efficient deployment.

摘要 / Abstract

分类 / Categories

深度分析