Foundation Models for Embodied Navigation: A Survey

Longxi Gao1*, Weikai Xie1*, Haoze Qian1, Rongjie Yi1, Shihe Wang1, Jiaye Song1, Dongqi Cai2 Jinliang Yuan3 Yunhao Liu3 Xuanzhe Liu4 Shangguang Wang1 Mengwei Xu1†
1Beijing University of Posts and Telecommunications 2Nanjing University 3Tsinghua University 4Peking University
*Equal contribution. Corresponding author.
Teaser figure for foundation models in embodied navigation

Overview: Embodied navigation systems allow robots autonomously navigate to desired physical locations by understanding task instructions and comprehending environmental information. These systems are typically supported by architectural designs that integrate perception, memory, reasoning, and control, and are deployed across diverse embodiments, scenes, and task settings, with performance evaluated along multiple dimensions.

Abstract

The remarkable advancements of embodied navigation, the process of physically situated agents perceiving and reasoning through egocentric observations to reach target locations, have recently been reshaped by the emergence of foundation models. Departing from traditional, task-specific policies trained from scratch on limited datasets, this new paradigm leverages the reasoning and multimodal capabilities of Large Language Models (LLMs), Vision-Language Models (VLMs), and Video Generation Models (VGMs), achieving superior generalization and flexible decision-making in unseen environments.

For the first time, this survey systematically reviews the landscape of foundation models for embodied navigation, focusing on systems where these models play a central role in perception, long-horizon memory management, and action generation. The survey aims to categorize and interpret existing embodied navigation research through the lens of design paradigms, data sources, and training strategies. Through this analysis, we offer a synthesized outlook on the evolution of navigation brains, highlight the bottlenecks of dataset bias, and contribute guidance for future research, hoping to bring the field closer to robust, general-purpose embodied intelligence.

What This Survey Covers

Problem Formulation and Taxonomy

We formalize the embodied navigation problem and distinguish it from other embodied intelligence tasks, such as locomotion and manipulation. We then present taxonomies of navigation tasks and robotic embodiments.

Key Design Dimensions

We analyze how agents encode observations and spatial structure, how they maintain and update memory under partial observability, and how they translate accumulated information into navigation decisions and executable motion. We then extend the discussion to the system level and examine what architectural designs are used to coordinate these input, memory, and output components, including modular systems, single-policy systems, dual-system designs, and world-model-based variants.

Data Collection and Training Strategies

We summarize the major data sources that support navigation models and discuss representative learning paradigms built on them, highlighting dataset biases and leakage issues that remain key obstacles to real-world generalization.

Efficient Deployment

We review efficient deployment for embodied navigation from two perspectives: (1) real-world deployment across robotic embodiments and (2) acceleration techniques spanning model design and software-system optimization.

Benchmarks and Evaluation Metrics

We organize existing benchmarks by task type and highlights the distinct evaluation objectives associated with each category. It then summarizes the major families of evaluation metrics, including task success, trajectory quality, instruction or semantic alignment, generalization, safety, and system efficiency.

Future Directions

Establishing the scaling law for embodied navigation.

While scaling has proven transformative for language and beyond, the path to a similar scaling law for navigation remains obscured by a severe data bottleneck. Existing synthetic and simulation data suffer from a persistent sim-to-real gap and the lack of data diversity, particularly in visual geometry and physical dynamics. Conversely, there is a distinct lack of large-scale, diverse, and high-quality real-world navigation data to provide the necessary supervision for multi-billion parameter models. Bridging this gap through better data engines or massive-scale robot logs is essential.

Converging VLM and VGM backbones.

A fundamental question for future navigation foundation models is whether to adopt VLMs or VGMs as the primary backbone. Theoretically, embodied navigation requires both the strong semantic reasoning and instruction-following capabilities of VLMs, as well as the world-modeling and physical-prediction capabilities of VGMs. Moving forward, the field should explore architectures that fuse these strengths, enabling agents that can both understand complex linguistic goals and predict the physical consequences of their movements.

Developing next-generation benchmarks.

Many existing benchmarks are increasingly outdated, failing to capture the requirements of modern navigation foundation models such as open-vocabulary reasoning and social compliance. Future benchmarks must move beyond terminal success rates to include rigorous evaluation of instruction fidelity, real-time latency, and robustness under dynamic disturbances.

Hardware-aware algorithmic optimization.

Given the strict hardware constraints of physical robots, future work must focus on bridging the gap between heavy foundation models and edge hardware. This includes advancing asynchronous inference, algorithmic compression, and intelligent device-cloud orchestration to ensure that reasoning does not come at the cost of reactive safety.

Read the Full Paper

BibTeX

@misc{longxi2026fmnav,
          title        = {Foundation Models for Embodied Navigation: A Survey},
          author       = {Gao, Longxi and Xie, Weikai and Qian, Haoze and Yi, Rongjie and Wang, Shihe and Song, Jiaye and Cai, Dongqi and Yuan, Jinliang and Liu, Yunhao and Liu, Xuanzhe and Wang, Shangguang and Xu, Mengwei},
          year         = {2026},
          howpublished = {\url{https://MEmbodied.github.io/embodied-navigation-survey}},
          note         = {Online; accessed 2026-03-25}
        }