This contains a set of scripts for downloading a dataset from Airbnb.
First, you need git lfs
to clone the repository. Install it from command line:
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install
You can now clone the repository:
git clone https://github.com/airbert-vln/bnb-dataset.git
If you clone the repository without LFS installed, you should have received an error message. You can fix it by running:
make lfs
You need to have a recent version of Python (3.8 or higher) and install dependencies through poetry
:
# install python for ubuntu 20.04
sudo apt install python3 python3-pip
pip install poetry
# install dependencies
poetry install
# activate the environment (do it at each new shell)
poetry shell
Note that typing is extensively used in these scripts. This was a real time saver for detecting errors before runtime. You might want to setup properly your IDE to play well with mypy
. I recommend the coc.nvim
extension coc-pyright
for neovim users.
Managing a large of images is tricky and usually take a lot of times. Usually, the scripts are splitting the task among several workers. A cache folder is keeping the order list for each worker, while each worker is producing its own output file.
Look for num_workers
or num_procs
parameters in the argtyped Arguments
.
This step is building a TSV file with 4 columns: listing ID, photo ID, image URL, image caption. A too high request rate would induce a rejection from Airbnb. Instead, it is advised to split the job among different IP addresses.
Please note that you can use the pre-computed TSV file used in our paper for training and for testing. The file was generated during Christmas 2019 (yeah, before Covid. Sounds so far away now!). Some images might not be available anymore.
Also, note that this file contains only a portion from the total of Airbnb listings. It might be interesting to extend it.
Airbnb listings are searched among a specific region. We need first to initialize the list of regions. A quick hack for that consists in scrapping Wikipedia list of places, as done in the script cities.py.
For this script, you need to download and install Selenium. Instructions here are valid only for a Linux distribution. Otherwise, follow the guide from Selenium documentation.
pip install selenium
wget https://github.com/mozilla/geckodriver/releases/download/v0.30.0/geckodriver-v0.30.0-linux32.tar.gz
mkdir -p $HOME/.local/bin
export PATH=$PATH:$HOME/.local/bin
tar -xvf geckodriver-v0.30.0-linux32.tar.gz -C $HOME/.local/bin
# Testing the driver path is recognized:
geckodriver --version
Here is how I scrapped a list of cities. You might want to update this script to order to increase the amount of cities.
python cities.py --output data/cities.txt
You can see other examples in the locations/
folder, used as an attempt to enlarge the BnB dataset.
# Download a list of listing from the list of cities
python search_listings.py --locations data/cities.txt --output data/listings
# Download JSON files for each listing
python download_listings.py --listings data/listings.txt --output data/merlin --with_photo
# Note you can download also reviews and infos (see python download_listings.py --help)
# Extract photo URLs from listing export files
python extract_photo_metadata.py --merlin data/merlin --output data/bnb-dataset-raw.tsv
# Apply basic rules to remove some captions
python filter_captions.py --input data/bnb-dataset-raw.tsv --output data/bnb-dataset.tsv
Now we want to download images and filter out outdoor images.
The download rate can be higher before the server kicks us out. However, it is still preferable to use a pool of IP addresses.
python download_images.py --csv_file data/bnb-dataset.tsv --output data/images --correspondance /tmp/cache-download-images/
python detect_errors.py --images data/images --merlin data/merlin
Outdoor images tend to be of lower qualities and captions are often not relevant. We first detect outdoor images from a CNN pretrained on the places365 dataset. Later on, we will keep indoor images.
Note that the output of this step is also used for image merging.
# Detect room types
python detect_room.py --output data/places365/detect.tsv --images data/images
# Filter out indoor images
python extract_indoor.py --output data/bnb-dataset-indoor.tsv --detection data/places365/detect.tsv
Extract visual features and store them on a single file. Several steps are required to achieve that. Unfortunately, we don't own permissions over Airbnb images, and thus we are not permitted to share our own LMDB file.
5% of the dataset is allocated to the testset:
round() {
printf "%.${2}f" "${1}"
}
num_rows=$(wc -l data/bnb-dataset-indoor.tsv)
test=$((num_rows * 0.05))
test=$(round $test)
cat data/bnb-dataset-indoor.tsv | tail -n $test > data/bnb-test-indoor-filtered.tsv
train=$((num_rows - test))
cat data/bnb-dataset-indoor.tsv | head -n $train > data/bnb-train-indoor-filtered.tsv
This step is one of the most annoying one, since the install of bottom-up top-down attention is outdated. I put docker file and Singularity definition file in the folder container
to help you with that.
Note that this step is also extremely slow and you might want to use multiple GPUs.
python precompute_airbnb_img_features_with_butd.py --images data/images
If this step is too difficult, open an issue and I'll try to use the PyTorch version instead.
# Extract keys
python extract_keys.py --output data/keys.txt --datasets data/bnb-dataset.indoor.tsv
# Create an LMDB
python convert_to_lmdb.py --output img_features --keys data/keys.txt
Note that you can split the LMDB into multiple files by using a number of workers. This could be relevant when your LMDB file is super huge!
Almost there! We built image-caption pairs and now we want to convert them into path-instruction pairs. Actually, we are just going to produce JSON files that you can feed into the training repository.
python preprocess_dataset.py --csv data/bnb-train.tsv --name bnb_train
python preprocess_dataset.py --csv data/bnb-test.tsv --name bnb_test
python merge_photos.py --source bnb_train.py --output merge+bnb_train.py --detection-dir data/places365
python merge_photos.py --source bnb_test.py --output merge+bnb_test.py --detection-dir data/places365
python preprocess_dataset.py --csv data/bnb-dataset.indoor.tsv --captionless True --min-caption 2 --min-length 4 --name 2capt+bnb_train
python preprocess_dataset.py --csv datasets/data/bnb-dataset.indoor.tsv --captionless True --min-caption 2 --min-length 4 --name 2capt+bnb_test
# Extract noun phrases from BnB captions
python extract_noun_phrases.py --source data/airbnb-train-indoor-filtered.tsv --output data/bnb-train.np.tsv
python extract_noun_phrases.py --source data/airbnb-test-indoor-filtered.tsv --output data/bnb-test.np.tsv
# Extract noun phrases from R2R train set
python perturbate_dataset.py --infile R2R_train.json --outfile np_train.json --mode object --training True
You need to create a testset for each dataset. Here is an example for captionless insertion.
python build_testset.py --output data/bnb/2capt+testset.json --out-listing False --captions 2capt+bnb_test.json