主要改了生成npz的格式,其他没怎么动
This repository contains a PyTorch reimplementation of the bottom-up-attention project based on Caffe.
We use Detectron2 as the backend to provide completed functions including training, testing and feature extraction. Furthermore, we migrate the pre-trained Caffe-based model from the original repository which can extract the same visual features as the original model (with deviation < 0.01).
Some example object and attribute predictions for salient image regions are illustrated below. The script to obtain the following visualizations can be found here
Note that most of the requirements above are needed for Detectron2.
-
Install Detectron2 according to their official instructions here.
-
Compile other used tools using the following script:
# clone the repository $ git clone --recursive https://github.com/MILVLG/bottom-up-attention.pytorch # install apex $ git clone https://github.com/NVIDIA/apex.git $ cd apex $ python setup.py install $ cd .. # install the rest modules $ python setup.py build develop
Note that using the latest version of Detectron2 may result in a running error. Please use the recommended version in this repository.
If you want to train or test the model, you need to download the images and annotation files of the Visual Genome (VG) dataset. If you only need to extract visual features using the pre-trained model, you can skip this part.
The original VG images (part1 and part2) are to be downloaded and unzipped to the datasets
folder.
The generated annotation files in the original repository are needed to be transformed to a COCO data format required by Detectron2. The preprocessed annotation files can be downloaded here and unzipped to the dataset
folder.
Finally, the datasets
folders will have the following structure:
|-- datasets
|-- vg
| |-- image
| | |-- VG_100K
| | | |-- 2.jpg
| | | |-- ...
| | |-- VG_100K_2
| | | |-- 1.jpg
| | | |-- ...
| |-- annotations
| | |-- train.json
| | |-- val.json
The following script will train a bottom-up-attention model on the train
split of VG. We are still working on this part to reproduce the same results as the Caffe version.
$ python3 train_net.py --mode detectron2 \
--config-file configs/bua-caffe/train-bua-caffe-r101.yaml \
--resume
-
mode = {'caffe', 'detectron2'}
refers to the used mode. We only support the mode with Detectron2, which refers todetectron2
mode, since we think it is unnecessary to train a new model using thecaffe
mode. -
config-file
refers to all the configurations of the model. -
resume
refers to a flag if you want to resume training from a specific checkpoint.
Given the trained model, the following script will test the performance on the val
split of VG:
$ python3 train_net.py --mode caffe \
--config-file configs/bua-caffe/test-bua-caffe-r101.yaml \
--eval-only --resume
-
mode = {'caffe', 'detectron2'}
refers to the used mode. For the converted model from Caffe, you need to use thecaffe
mode. For other models trained with Detectron2, you need to use thedetectron2
mode. -
config-file
refers to all the configurations of the model, which also include the path of the model weights. -
eval-only
refers to a flag to declare the testing phase. -
resume
refers to a flag to declare using the pre-trained model.
Similar with the testing stage, the following script will extract the bottom-up-attention visual features with provided hyper-parameters:
$ python3 extract_features.py --mode caffe \
--config-file configs/bua-caffe/extract-bua-caffe-r101.yaml \
--image-dir <image_dir> --gt-bbox-dir <out_dir> --out-dir <out_dir> --resume
-
mode = {'caffe', 'detectron2'}
refers to the used mode. For the converted model from Caffe, you need to use thecaffe
mode. For other models trained with Detectron2, you need to use thedetectron2
mode. -
config-file
refers to all the configurations of the model, which also include the path of the model weights. -
image-dir
refers to the input image directory. -
gt-bbox-dir
refers to the ground truth bbox directory. -
out-dir
refers to the output feature directory. -
resume
refers to a flag to declare using the pre-trained model.
Moreover, using the same pre-trained model, we provide a two-stage strategy for extracting visual features, which results in (slightly) more accurate visual features:
# extract bboxes only:
$ python3 extract_features.py --mode caffe \
--config-file configs/bua-caffe/extract-bua-caffe-r101-bbox-only.yaml \
--image-dir <image_dir> --out-dir <out_dir> --resume
# extract visual features with the pre-extracted bboxes:
$ python3 extract_features.py --mode caffe \
--config-file configs/bua-caffe/extract-bua-caffe-r101-gt-bbox.yaml \
--image-dir <image_dir> --gt-bbox-dir <bbox_dir> --out-dir <out_dir> --resume
We provided pre-trained models here. The evaluation metrics are exactly the same as those in the original Caffe project. More models will be continuously updated.
Model | Mode | Backbone | Objects mAP@0.5 | Objects weighted mAP@0.5 | Download |
---|---|---|---|---|---|
Faster R-CNN | Caffe, K=36 | ResNet-101 | 9.3% | 14.0% | model |
Faster R-CNN | Caffe, K=[10,100] | ResNet-101 | 10.2% | 15.1% | model |
Faster R-CNN | Caffe, K=[10,100] | ResNet-152 | 11.1% | 15.7% | model |
This project is released under the Apache 2.0 license.
This repo is currently maintained by Jing Li (@J1mL3e_) and Zhou Yu (@yuzcccc).