Paper Detail

Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation基于过滤合成语料库与两阶段大语言模型适配的文档级机器翻译提升

cs.CL大语言模型端到端Transformer热门获取

Ireh Kim, Tesia Sker, Chanwoo Kim

2026年03月24日

arXiv: 2603.22186v1

作者人数

3

标签数量

4

内容状态

含 PDF

原文 + 中文

同页查看标题和摘要的双语信息

PDF 预览

直接在详情页阅读或下载论文全文

深度分析

继续下钻到 AI 生成的结构化解读

摘要 / Abstract

This paper addresses the challenge of improving document-level machine translation using Large Language Models. The authors propose a two-stage fine-tuning strategy that augments training data by converting summarization data into document-level parallel data using LLMs. To ensure data quality, they filter the synthetic corpus using multiple metrics including sacreBLEU, COMET, and LaBSE-based cosine similarity. The approach tackles two key challenges: the scarcity of large-scale document-level parallel data and the tendency of LLMs to generate hallucinations and omissions. By leveraging LLMs' strength in modeling contextual information, this method aims to improve coherence across sentences in translation tasks.

摘要 / Abstract

分类 / Categories

深度分析