Foundation Models for Embodied Navigation: A Survey
Abstract
The remarkable advancements of embodied navigation, the process of physically situated agents perceiving and reasoning through egocentric observations to reach target locations, have recently been reshaped by the emergence of foundation models. Departing from traditional, task-specific policies trained from scratch on limited datasets, this new paradigm leverages the reasoning and multimodal capabilities of Large Language Models (LLMs), Vision-Language Models (VLMs), and Video Generation Models (VGMs), achieving superior generalization and flexible decision-making in unseen environments.
For the first time, this survey systematically reviews the landscape of foundation models for embodied navigation, focusing on systems where these models play a central role in perception, long-horizon memory management, and action generation. The survey aims to categorize and interpret existing embodied navigation research through the lens of design paradigms, data sources, and training strategies. Through this analysis, we offer a synthesized outlook on the evolution of navigation brains, highlight the bottlenecks of dataset bias, and contribute guidance for future research, hoping to bring the field closer to robust, general-purpose embodied intelligence.
What This Survey Covers
Problem Formulation and Taxonomy
We formalize the embodied navigation problem and distinguish it from other embodied intelligence tasks, such as locomotion and manipulation. We then present taxonomies of navigation tasks and robotic embodiments.
Key Design Dimensions
We analyze how agents encode observations and spatial structure, how they maintain and update memory under partial observability, and how they translate accumulated information into navigation decisions and executable motion. We then extend the discussion to the system level and examine what architectural designs are used to coordinate these input, memory, and output components, including modular systems, single-policy systems, dual-system designs, and world-model-based variants.
Data Collection and Training Strategies
We summarize the major data sources that support navigation models and discuss representative learning paradigms built on them, highlighting dataset biases and leakage issues that remain key obstacles to real-world generalization.
Efficient Deployment
We review efficient deployment for embodied navigation from two perspectives: (1) real-world deployment across robotic embodiments and (2) acceleration techniques spanning model design and software-system optimization.
Benchmarks and Evaluation Metrics
We organize existing benchmarks by task type and highlights the distinct evaluation objectives associated with each category. It then summarizes the major families of evaluation metrics, including task success, trajectory quality, instruction or semantic alignment, generalization, safety, and system efficiency.
Figures from the Survey
Timeline of representative systems across the recent foundation-model era.
Task taxonomy covering semantic, geometric, and interactive navigation.
Embodiment taxonomy across wheeled robots, legged robots, and UAVs.
Key design dimensions from perception and memory to action generation.
Architectural patterns including modular, single-policy, dual-system, and world-model-based systems.
Future Directions
Establishing the scaling law for embodied navigation.
While scaling has proven transformative for language and beyond, the path to a similar scaling law for navigation remains obscured by a severe data bottleneck. Existing synthetic and simulation data suffer from a persistent sim-to-real gap and the lack of data diversity, particularly in visual geometry and physical dynamics. Conversely, there is a distinct lack of large-scale, diverse, and high-quality real-world navigation data to provide the necessary supervision for multi-billion parameter models. Bridging this gap through better data engines or massive-scale robot logs is essential.
Converging VLM and VGM backbones.
A fundamental question for future navigation foundation models is whether to adopt VLMs or VGMs as the primary backbone. Theoretically, embodied navigation requires both the strong semantic reasoning and instruction-following capabilities of VLMs, as well as the world-modeling and physical-prediction capabilities of VGMs. Moving forward, the field should explore architectures that fuse these strengths, enabling agents that can both understand complex linguistic goals and predict the physical consequences of their movements.
Developing next-generation benchmarks.
Many existing benchmarks are increasingly outdated, failing to capture the requirements of modern navigation foundation models such as open-vocabulary reasoning and social compliance. Future benchmarks must move beyond terminal success rates to include rigorous evaluation of instruction fidelity, real-time latency, and robustness under dynamic disturbances.
Hardware-aware algorithmic optimization.
Given the strict hardware constraints of physical robots, future work must focus on bridging the gap between heavy foundation models and edge hardware. This includes advancing asynchronous inference, algorithmic compression, and intelligent device-cloud orchestration to ensure that reasoning does not come at the cost of reactive safety.
Read the Full Paper
BibTeX
@misc{longxi2026fmnav,
title = {Foundation Models for Embodied Navigation: A Survey},
author = {Gao, Longxi and Xie, Weikai and Qian, Haoze and Yi, Rongjie and Wang, Shihe and Song, Jiaye and Cai, Dongqi and Yuan, Jinliang and Liu, Yunhao and Liu, Xuanzhe and Wang, Shangguang and Xu, Mengwei},
year = {2026},
howpublished = {\url{https://MEmbodied.github.io/embodied-navigation-survey}},
note = {Online; accessed 2026-03-25}
}