Memory Over Maps: 3D Object Localization Without Reconstruction
This paper addresses the fundamental question of whether complete 3D scene reconstruction is necessary for object localization in embodied tasks. The authors propose a map-free pipeline that stores only posed RGB-D keyframes as a lightweight visual memory, eliminating the need for global 3D representations. At query time, the method retrieves candidate views and re-ranks them using a vision-language model for semantic reasoning. A sparse on-demand 3D estimate of the target is constructed through depth backprojection, enabling efficient localization without expensive reconstruction. This approach significantly reduces mapping time, storage overhead, and scalability limitations while maintaining effective performance for navigation and manipulation tasks.