LagMemo

Abstract

Navigating to a designated goal using visual information is a fundamental capability for intelligent robots. Most classical visual navigation methods are restricted to single-goal, single-modality, and closed set goal settings. To address the practical demands of multi-modal, open-vocabulary goal queries and multi-goal visual navigation, we propose LagMemo, a navigation system that leverages a language 3D Gaussian Splatting memory. During exploration, LagMemo constructs a unified 3D language memory. With incoming task goals, the system queries the memory, predicts candidate goal locations, and integrates a local perception-based verification mechanism to dynamically match and validate goals during navigation. For fair and rigorous evaluation, we curate GOAT-Core, a high-quality core split distilled from GOAT-Bench tailored to multi-modal open-vocabulary multi-goal visual navigation. Experimental results show that LagMemo’s memory module enables effective multi-modal open-vocabulary goal localization, and that LagMemo outperforms state-of-the-art methods in multi-goal visual navigation.

Video

Overview

Framework

Language 3DGS Memory Reconstruction and Memory-Guided Visual Navigation Pipeline

Experiments

Goal Localization

Instance Query Illustration: Different query texts are input, and corresponding responses are retrieved from the 3DGS. The rendered results from corresponding viewpoints show that the queries match the expected outcomes.

Visual Navigation

Step-by-step Visualization of Memory-Guided Navigation to an Image Goal: Columns show key steps (28, 67, 141, 165), rows show the front view, the top-down map, and the 3D localization results (red). In this case, the agent reaches waypoint-1/2/3 (yellow star; current waypoint in red). After checking the first two, it arrives at the third where the goal verification module identifies the goal. Then the agent proceeds to the final goal (green star) and the subtask successfully terminates at step 165.

LagMemo: Language 3D Gaussian Splatting Memory for Multi-modal Open-vocabulary Multi-goal Visual Navigation