Multi-modal features toolkit in Python, developed at the University of Cambridge Computer Laboratory. The aim of this toolkit is to make it easier for researchers to use multi-modal features. Both image and sound (i.e., visual and auditory representations) are supported.
The following models are currently available:
- CNN: Convolutional neural network representations for images
- BoVW: Bag-of-visual-words for images, using DSIFT local descriptors
- BoAW: Bag-of-audio-words for sound files, using MFCC local descriptors
The following dependencies need to be installed: numpy, scipy, scikit-learn and yaml. If you want to use the CNN model, you will also need to install Caffe. For BoAW you will need to install librosa as well.
sudo apt-get install build-essential python-dev python-setuptools \
python-numpy python-scipy python-sklearn python-yaml
The toolkit comes with two tools that do not require any knowledge of Python and that can be run from the command-line.
For mining images or sound files. Before you can use the miner you need to acquire API keys from Google, Bing, FreeSound or Flickr and set them in miner.yaml
(see miner-example.yaml
for an example). The query_file
argument should point to a file that contains a list of queries, one query per line. Usage:
miner.py [-h] [-n NUM_FILES]
{bing,google,freesound,flickr} query_file data_dir
Examples:
# Get 10 images per query term from Bing and store in a data directory
python miner.py -n 10 bing list_of_queries.txt ./img_data_dir
# Get 100 sound files per query term from FreeSound and store in a data directory
python miner.py -n 100 freesound list_of_queries.txt ./sound_data_dir
For extracting representations from a data directory. The data directory needs to contain an index file (index.pkl
) that the is automatically generated by the miner, or that you can manually construct. Usage:
extract.py [-h] [-gpu] [-k K] [-c CENTROIDS] [-o {pickle,json,csv}]
[-s SAMPLE_FILES] [-m {vgg,alexnet}] [-v]
{boaw,bovw,cnn} data_dir out_file
Examples:
# Extract BoVW representations with k=100, sampling 10% for clustering, and store as a Python pickle.
python extract.py -k 100 -s 0.1 bovw ./img_data_dir ./output_vectors.pkl
# Extract CNN representations, using an AlexNet on a GPU, and store as a JSON file.
python extract.py -gpu -o json cnn ./img_data_dir ./output_vectors.json
# Extract BoAW representation with k=300, sampling 50% for clustering, and store as a CSV file.
python extract.py -k 300 -s 0.5 -o csv boaw ./sound_data_dir ./output_vectors.csv
To extract layers from the CNN you need to tell the toolkit where it can find Caffe. For example (run this, or simply add to your ~/.bashrc
):
export CAFFE_ROOT_PATH="/usr/local/caffe/"
The demo downloads images from either Google or Bing and creates BoVW or CNN representations. It then evaluates similarity and relatedness (i.e., Spearman correlation with human similarity ratings) on the well-known MEN and SimLex-999 datasets. See e.g. Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics
The demo downloads the ESP Game dataset sample and extracts it. It then builds an index from the label lookup and obtains BoAW or CNN representations for the thumbnail images. The representations are stored in a file for later use.
A simple demo to show that you can get local descriptors from Matlab and load them. This means you can use VLFeat or other libraries for getting descriptors (for instance, PHOW) as well.
The demo downloads sound files for 8 instruments of two classes and obtains auditory representations. It then clusters the representations and reports the outcomes. See Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception
The demo downloads images for "elephant" and "happiness" and calculates the image dispersion scores of these concepts. See Improving Multi-Modal Representations Using Image Dispersion: Why Less is Sometimes More.
A simple plotting demo of images returned by various search engines. Requires matplotlib.