返回论文列表
Paper Detail
On the Challenges and Opportunities of Learned Sparse Retrieval for Code
cs.IR大语言模型热门获取
Simon Lupart, Maxime Louis, Thibault Formal, Hervé Déjean, Stéphane Clinchant
2026年03月23日
arXiv: 2603.22008v1

作者人数

5

标签数量

2

内容状态

元数据

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.

在 arXiv 查看

分类 / Categories

cs.IRcs.CL

深度分析

AI 深度理解论文内容,生成具有洞见性的总结