摘要 / Abstract

This paper presents a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models trained on identical data (50M tokens from TinyStories), compute budget (20,000 steps, batch size 32, sequence length 512), and hardware (NVIDIA H100 80GB). The study reveals three key findings: both paradigms achieve comparable training throughput (~50K tokens/second), AR converges faster but overfits earlier while MDLM improves more gradually, and there exists a structural diversity-fluency trade-off with AR producing fluent but repetitive outputs and MDLM generating more diverse narratives.

摘要 / Abstract

分类 / Categories

深度分析