SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

Gengze Zhou^🍕; Yicong Hong^🌭; Zun Wang^🍔; Chongyang Zhao^🌮; Mohit Bansal^🍔; Qi Wu^🍕

^🍕AIML, University of Adelaide ^🌭Adobe Research ^🍔UNC, Chapel Hill ^🌮UNSW Sydney

🍹 Abstract

The academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the latter concentrates on following detailed textual commands. Despite the differing focuses of these tasks, the underlying requirements of interpreting instructions, comprehending the surroundings, and inferring action decisions remain consistent. This paper consolidates diverse navigation tasks into a unified and generic framework -- we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. Powered by SAME, we present a versatile agent capable of addressing seven navigation tasks simultaneously that outperforms or achieves highly comparable performance to task-specific agents.

🍸 Method

Figure 1. We consolidate diverse navigation tasks into a unified language-guided navigation framework sorted by language granularity. Previous approaches utilize task-specific designs tailored to address particular types of language instructions, as shown in (a) and (b). In contrast, we propose a versatile system that can interpret and execute arbitrary language instructions as shown in (c).

Figure 2. Illustration of MoE position and experts’ routing methods. SAME routing based on multimodal features from visual observations and language instructions allows the agent to dynamically adapt to environmental visual changes.

🍻 TODOs

Release SAME finetuning code.
Release multi-task co-training data.
Release pretrained models weights.
Release data preparation scripts.

🥂 Acknowledgements

We extend our gratitude to MatterPort 3D for their valuable contributions to the open-source platform and community.

We also acknowledge the significant benefits of using DUET, ScaleVLN and NaviLLM in this work. Our thanks go out to the creators of these outstanding projects.

🍺 Citation

If you find this work helpful, please consider citing:

@article{zhou2024same,
  title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts}, 
  author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu},
  journal={arXiv preprint arXiv:2412.05552},
  year={2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

🍹 Abstract

🍸 Method

🍻 TODOs

🥂 Acknowledgements

🍺 Citation

About

Releases

Packages

License

GengzeZhou/SAME

Folders and files

Latest commit

History

Repository files navigation

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

🍹 Abstract

🍸 Method

🍻 TODOs

🥂 Acknowledgements

🍺 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages