Official code release for Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents.
Paper (arXiv) · Project page · Annotations (Google Drive)
SkillNav/
├── skillnav/
│ └── backbones/
│ ├── scalevln/ # SkillNav built on ScaleVLN (ViT-B/16)
│ │ ├── maps_nav_src/ # working dir for train/test
│ │ │ ├── moe/ # skill-based agents + VLM router
│ │ │ ├── models/ # transformer / VLN-BERT backbone
│ │ │ ├── r2r/ # navigation env, agent loop, parser
│ │ │ ├── prompts/ # router / reordering / data prompts
│ │ │ ├── evaluation/ # offline eval (NavNuances etc.)
│ │ │ ├── scripts/ # train / test bash scripts
│ │ │ └── utils/
│ │ └── datasets/ # features, annotations, ckpts
│ └── srdf/ # SkillNav built on VLN-SRDF (InternViT-6B)
│ ├── map_nav_src/ # same layout as scalevln/maps_nav_src
│ └── datasets/
├── assets/ # paper figures (PDF + PNG)
│ ├── figures/ # rendered figures used by the page
│ └── source/ # editable PDF sources
├── docs/ # extra documentation
├── static/ # project-page CSS / JS
├── index.html # project page
├── pyproject.toml
├── requirements.txt
└── README.md
The inner directory names
maps_nav_src/(ScaleVLN) andmap_nav_src/(SRDF) are kept verbatim from the upstream baselines so their internal bare imports (from utils.x,from moe.y, …) keep working without rewriting any source file.
Two backbone variants are kept side-by-side because they require different feature extractors and pretrained checkpoints (ScaleVLN-Aug vs. SRDF-Aug).
We use the latest version of the Matterport3D Simulator (not v0.1). Python 3.9 is recommended.
# system deps
sudo apt-get update
sudo apt-get install -y libjsoncpp-dev libepoxy-dev libglm-dev libopencv-dev \
libegl1 libegl1-mesa-dev libgl1-mesa-dev libtiff-dev \
libosmesa6 libosmesa6-dev libglew-dev
# conda packages
conda create -n skillnav python=3.9 -y && conda activate skillnav
conda install -c conda-forge cmake gdal libtiff libstdcxx-ng -y
# build the simulator (EGL backend)
cd Matterport3DSimulator
mkdir -p build && cd build
cmake -DEGL_RENDERING=ON -DPYTHON_EXECUTABLE="$(which python)" ..
make -j
# expose to PYTHONPATH
export PYTHONPATH=$(realpath ..):$PYTHONPATHgit clone https://github.com/HLR/SkillNav.git
cd SkillNav
pip install -r requirements.txt
pip install -e . # editable install of the skillnav packageThe router uses a VLM served via vLLM. If you plan to run the router locally, make sure
vllm, transformers>=4.45, and a compatible CUDA stack are installed (see
requirements.txt).
Download from the Google Drive folder and place under each backbone's annotations folder:
skillnav/backbones/scalevln/datasets/R2R/annotations/
skillnav/backbones/srdf/datasets/R2R/annotations/
| Backbone | Features | Init checkpoint |
|---|---|---|
| ScaleVLN | ViT-B/16 (same as ScaleVLN) | ScaleVLN-pretrained ViT-B/16 |
| SRDF | InternViT-6B (same as VLN-SRDF) | SRDF-pretrained checkpoint |
Drop them under each backbone's datasets/R2R/features/ and
datasets/R2R/trained_models/ directories. The bash scripts under each backbone's
scripts/ directory reference these paths directly.
Each skill specialist is trained on its own skill-specific augmentation split.
# ScaleVLN backbone
cd skillnav/backbones/scalevln/maps_nav_src
bash scripts/train_r2r_b16_mix_vertical.sh # Vertical Movement (VM)
bash scripts/train_r2r_b16_mix_direction.sh # Directional Adjustment (DA)
bash scripts/train_r2r_b16_mix_landmark.sh # Landmark Detection (LD)
bash scripts/train_r2r_b16_mix_region.sh # Area & Region Identification (AR)
bash scripts/train_r2r_b16_mix_stop.sh # Stop & Pause (SP)
bash scripts/train_r2r_b16_mix_temporal.sh # Temporal Reordering data# SRDF (InternViT-6B) backbone
cd skillnav/backbones/srdf/map_nav_src
bash scripts/train_r2r_internvit6b_mix_vertical.sh
# …same five skills…End-to-end evaluation uses the VLM-based action router (top-1 routing).
cd skillnav/backbones/scalevln/maps_nav_src/moe
python vLLM_API.py \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--port 8000Supported routers: Qwen2.5-VL-7B-Instruct, GLM-4.1V-9B-Thinking, GPT-4o (via API).
# R2R Val-Unseen / Test-Unseen
cd skillnav/backbones/scalevln/maps_nav_src
bash scripts/test_r2r_b16_moe-top1.sh
# GSA-R2R
bash scripts/test_gsa-r2r_b16_moe-top1.sh
# NavNuances per-skill eval
bash scripts/test_navnuance_b16_mix.shFor the SRDF backbone, use the analogous scripts under
skillnav/backbones/srdf/map_nav_src/scripts/.
SkillNav builds on two open-source VLN baselines — the upstream repos are:
The novel SkillNav code — the skill specialists, the temporal reordering module,
the VLM action router, the skill-specific synthetic data prompts — lives under each
backbone's moe/ and prompts/ directories.
@misc{ma2025breakingbuildingupmixture,
title = {Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents},
author = {Tianyi Ma and Yue Zhang and Zehao Wang and Parisa Kordjamshidi},
year = {2025},
eprint = {2508.07642},
archivePrefix = {arXiv},
primaryClass = {cs.AI},
url = {https://arxiv.org/abs/2508.07642}
}