Train Diffusion Trajectory Forecaster model using the Waymo Motion dataset for mmls hse project
To create env and install dependecies:
conda create -n diffusion_tracker python=3.10
conda activate diffusion_tracker
pip install uv
uv sync
You can run the whole project inside Docker with GPU access instead of installing the Python environment locally.
- Build the image from the repository root:
docker build -t diffusion-trajectory-forecaster .- Make the helper script executable:
chmod +x scripts/docker_run.sh- Start an interactive shell inside the container:
scripts/docker_run.sh bash- Run project commands inside that shell:
uv run python train.pyHow it works:
- the repository is mounted into the container at
/app - your code, checkpoints, outputs, and local changes stay on the host machine
- the container uses its own virtual environment at
/opt/venv, so Docker does not recreate or modify your host.venv - the helper script runs the container with your host UID/GID so generated files remain writable by your user and Git can stage them
- container-side cache and auth files are stored in gitignored
.docker-cache/
Notes:
- rebuild the image after this change so the container environment is created under
/opt/venv
- Apply for Waymo Open Dataset access.
- Install gcloud CLI
- Run
gcloud auth login <your_email>with the same email used for step 1. - Run
gcloud auth application-default login.
To build processed train/val/test datasets from raw Waymo data:
uv run python -m scripts.create_datasetProcessed datasets are tracked with DVC as directory artifacts. Git stores the .dvc metadata files, while the actual .wds files live locally or in the configured DVC remote.
Remote configuration:
- keep the remote URL in
.dvc/config - keep credentials such as
access_key_idandsecret_access_keyin.dvc/config.local - do not commit
.dvc/config.local
Amazon S3 credentials setup:
uv run dvc remote list
uv run dvc remote modify --local myremote access_key_id <AWS_ACCESS_KEY_ID>
uv run dvc remote modify --local myremote secret_access_key <AWS_SECRET_ACCESS_KEY>Notes:
- the shared repository config already defines the default DVC remote URL and region
To track new dataset run:
uv run scripts/add_local_dataset_to_dvc.sh path_to_dataset_folderit adds dataset to dvc, push it to remote storage and stages .dvc file.
Pull one dataset explicitly:
uv run dvc pull data/processed_v1.dvc
uv run dvc pull data/processed_v2.dvc
uv run dvc pull data/baseline1.dvcPush updated artifacts:
uv run dvc pushTraining uses a unified small_no_scenes dataset config.
- if the local dataset path exists, it is loaded directly
- there is no separate local/create/stream dataset config anymore
Notes:
- training first checks the local
data.*.pathdirectories - dataset generation can also happen through training when the local dataset is missing and
creation_cfgis set on the dataset config