摘要 / Abstract

This paper investigates whether 2D foundation image models inherently possess 3D world model capabilities by evaluating their performance on 3D world synthesis tasks. The authors propose a multi-agent architecture consisting of a VLM-based director, an image synthesizer, and a two-step verifier that evaluates outputs from both 2D image and 3D reconstruction spaces. Through systematic benchmarking of state-of-the-art image generation models and Vision-Language Models, they demonstrate that their agentic approach achieves coherent and robust 3D reconstruction, enabling exploration through novel view rendering. The research provides insights into leveraging implicit 3D knowledge from 2D foundation models for world-level scene understanding and generation.

摘要 / Abstract

分类 / Categories

深度分析