This repo has the implementation of our paper: Pretrained Language Models as Visual Planners for Human Assistance
- Create the virtual environment and activate
conda create --file env.yaml -p .venv_vlamp python==3.9.7
Note: If the installer complains about missing rust compiler while installing tokenizers. Install it using this link.
Note: If you get missing pkg-config
error on M1, install using the following and see this issue for more info.
brew install cmake
brew install pkg-config
conda activate ./.venv_vlamp
- Install the editable version
pip install -e .
Please follow the instructions in the official crosstask repository to download crosstask annotations into .data/crosstask
Once downloaded, you should see the following directory structure in the dataset directory:
|- crosstask
|- annotations
|- task_primary.txt
|- task_related.txt
|- videos.csv
|- videos_val.csv
has a list of 18 primary tasks where each task is listed using a task card in the following format with an empty line between tasks:
Task ID
Task name
URL of corresponding WikiHow page
Number of steps
Ordered list of comma-separated steps of the task
Following are the first few lines from task_primary.txt
Make Jello Shots
pour water,pour juice,pour jello powder,pour alcohol,stir mixture,pour mixture into cup
Build Simple Floating Shelves
cut shelve,assemble shelve,sand shelve,paint shelve,attach shelve
Note that each step will get a unique step id in the context of the task based on its position in the last line of the task card.
We use the primary tasks for our work.
The videos.csv
contains link for videos in the format <Task ID>,<YouTube video ID>,<URL>
per line.
The annotations
folder contains one file per video named as <task id>_<video id>.csv
. Each file contains one step per line in the following format, <number of step>,<start in seconds>,<end in seconds>
, where <number of step>
is the id of the step in the task_primary.txt
. For example, following are the contents of annotations/113766__uAZw6L_R5g.csv
into a separate folder and use python --dataset_root data/crosstask --n_workers 10 --download_dir data/crosstask/videos
to download videos in a directory named videos
in the crosstask
data directory.
in a separate folder and follow the instructions therein to create S3D features.
We use the official implementation of VideoCLIP to perform video action segmentation. Note that our model omits the last FC layer (1024->512) of S3D model while VidoeCLIP retains this layer. So S3D feature size for input to our model should be 1024 but for VideoCLIP it is 512.
in a separate directory. Go to the folder examples/MMPT
and follow the instructions there to install VideoCLIP and to preprocess crosstask dataset.
Once setup execute the following to create 512-dimensional S3D features for the VideoCLIP model.
python scripts/video_feature_extractor/ --vdir <path_to_data>/crosstask/videos --fdir <path_to_data>/crosstask/s3dg_features_vidoeclip --type=s3d --num_decoding_thread=20 --batch_size 32 --half_precision 1
Modify projects/retri/videoclip/crosstask_videoclip_seg_new_split.yaml
to point to the <path_to_data>
as shown below:
train_path: <path_to_data>/crosstask/videos.csv
train_csv_path: <path_to_data>/crosstask/videos.csv
val_path: <path_to_data>/crosstask/videos_val.csv
val_csv_path: <path_to_data>/crosstask/videos_val.csv
primary_path: <path_to_data>/crosstask/tasks_primary.txt
related_path: <path_to_data>/crosstask/tasks_related.txt
vfeat_dir: <path_to_data>/crosstask/s3dg_features_videoclip
annotation_path: <path_to_data>/crosstask/annotations
To finetune on crosstask execute the following:
python projects/retri/videoclip/crosstask_videoclip_seg_new_split.yaml --jobtype local_small
To perform prediction using the finetuned model execute the following:
python projects/retri/videoclip/test_crosstask_videoclip_seg_new_split.yaml --jobtype local_predict
Finally, move the preds.json
file into <path_to_data>/crosstask/videoclip_preds.json
to use with VLaMP.
The annotations can be downloaded from
Download the splits used in our work using the following:
Follow the same procedure as done for crosstask to obtain S3D features and VideoCLIP segmentation results for COIN.
CUDA_DEVICE=0 seed=123 allennlp train best_model_configs_no_wandb/p3lm-GT.jsonnet -s <path to dir where model should be saved> --include-package vlamp
Download pretrained models:
Dataset | Model id | drive id |
Crosstask | yyaqgtoq | 1AxJYiayjUTfWC_LEITKcBDYXcn39Pe54 |
Crosstask | 9t05mx8h | 1ubyOF_DDzqmPMqkPxrljudZCo7NjZn5w |
Crosstask | c0eovx5i | 1njBlBn-hZmq7TP6nk5IW2BtIg4i6U7Jk |
Crosstask | e0mu18l3 | 17vi89l0hYPyq0C0kPq-xtAc1aU7qv7aY |
Crosstask | 10yxgxr1 | 1oCNW5Fw_mxieHZPAQuAcommitrUGhhsN |
COIN | p1z5i1xe | 16oku7zwPyKhC6vfo5hmE61EseCQsiUFZ |
COIN | 7cq1owwq | 1gBBipLfTqSKwDTNfBurC_9vBqebcBLGj |
COIN | h1lgoxdm | 1WzLe33pxwtwNUV-sgFxS-Yx1KPaZKu0l |
COIN | s71nb0kl | 1XW4h_qCTRDRZXgP4_01UHFK1o4HFs35P |
COIN | pi1vd8h6 | 1YZc0CbwEJGz9PMRj3a7YoWS50qis2Xn1 |
Download each model by opening
and then unzip each model into separate directory.
To run eval on a model run
with appropriate model path.
The majority of VLaMP is licensed under CC-BY-NC, however, portions of the projects are available under separate license terms:HuggingFace Transformers are licensed under the Apache2.0 license; S3D and COIN are licensed under the MIT license; CrossTask is licensed under the BSD-3.