Descriptor Vector Exchange

This repo provides code for learning dense landmarks without supervision. Our approach is described in the ICCV 2019 paper "Unsupervised learning of landmarks by exchanging descriptor vectors".

High level Overview: The goal of this work is to learn a dense embedding Φ_u(x) ∈ R^C of image pixels without annotation. Our starting point was the Dense Equivariant Labelling approach of [3] (references follow at the end of the README), which similarly tackles the same problem, but is restricted to learning low-dimensional embeddings to achieve the key objective of generalisation across different identities. The key focus of Descriptor Vector Exchange (DVE) is to address this dimensionality issue to enable the learning of more powerful, higher dimensional embeddings while still preserving their generalisation ability. To do so, we take inspiration from methods which enforce transitive/cyclic consistency constraints [4, 5, 6].

The embedding is learned from pairs of images (x,x′) related by a known warp v = g(u). In the image above, on the left we show the approach used by [3], which directly matches embedding Φ_u(x) from the left image to embeddings Φ_v(x′) in the right image to generate a loss. On the right, DVE replaces Φ_u(x) with its reconstruction Φˆ_u(x|xα) obtained from the embeddings in a third auxiliary image xα (the correspondence with xα does not need to be known). This mechanism encourages the embeddings to act consistently across different instances, even when the dimensionality is increased (see the paper for more details).

Requirements: The code assumes PyTorch 1.1 and Python 3.6/3.7 (other versions may work, but have not been tested). See the section on dependencies towards the end of this file for specific package requirements.

Learned Embeddings

We provide pretrained models for each dataset to reproduce the results reported in the paper [1]. The training is performed with CelebA, a dataset of over 200k faces of celebrities that was originally described in this paper. We use this dataset to train our embedding function without annotations.

Each model is accompanied by training and evaluation logs and its mean pixel error performance on the task of matching annotated landmarks across the MAFL test set (described in more detail below). We use two architectures: the smallnet model of [3] and the more powerful hourglass model, inspired by its effectiveness in [7].

The goal of these initial experiments is to demonstrate that DVE allows models to generalise across identities even when using higher dimensional embeddings (e.g. 64d rather than 3d). By contrast, this does not occur when DVE is removed (see the ablation section below).

Embed. Dim	Model	Same Identity	Different Identity	Params	Links
3	smallnet	1.36	3.03	334.9k	config, model, log
16	smallnet	1.28	2.79	338.2k	config, model, log
32	smallnet	1.29	2.79	342.3k	config, model, log
64	smallnet	1.28	2.77	350.6k	config, model, log
64	hourglass	0.93	2.37	12.6M	config, model, log

Notes: The error metrics for the hourglass model, which are included for completeness, are approximately (but are not exactly) comparable to the metrics for the smallnet due to very slight differences in the cropping ratios used by the two architectures (0.3 for smallnet, 0.294 for Hourglass).

Landmark Regression

Protocol Description: To transform the learned dense embeddings into landmark predictions, we use the same approach as [3]. For each target dataset, we freeze the dense embeddings and learn to peg onto them a collection of 50 "virtual" keypoints via a spatial softmax. These virtual keypoints are then used to regress the target keypoints of the dataset. We report the error as a percentage of inter-ocular distance (a metric defined by the landmarks of each dataset).

MAFL landmark regression

MAFL is a dataset of 20k faces which includes landmark annotations. The dataset is partitioned into 19k training images and 1k testing images.

Embed. Dim	Model	Error (%IOD)	Links
3	smallnet	4.17	config, model, log
16	smallnet	3.97	config, model, log
32	smallnet	3.82	config, model, log
64	smallnet	3.42	config, model, log
64	hourglass	2.86	config, model, log

300-W landmark regression

The 300-W This dataset contains 3,148 training images and 689 testing images with 68 facial landmark annotations for each face (with the split introduced this this CVPR 2014 paper). The dataset is described in this 2013 ICCV workshop paper.

Embed. Dim	Model	Error (%IOD)	Links
3	smallnet	7.66	config, model, log
16	smallnet	6.29	config, model, log
32	smallnet	6.13	config, model, log
64	smallnet	5.75	config, model, log
64	hourglass	4.65	config, model, log

AFLW landmark regression

The original AFLW contains around 25k images with up to 21 landmarks. For the purposes of evaluating five-landmark detectors, the authors of TCDCN introduced a test subset of almost 3K faces (for convenience, we include a mirror version of these images, but you can obtain the originals here)

There are two slightly different partitions of AFLW that have been used in prior work (we report numbers on both to allow for comparison). One is a set of recropped faces released by [7] (2991 test faces with 132 duplicates, 10122 train faces) (here we call this AFLW_R). The second is the train/test partition of AFLW used in the works of [2,3] which used the existing crops from MTFL (2995 faces) for testing and 10122 AFLW faces for training (we call this dataset split AFLW_M).

Additionally, in the tables immediately below, each embedding is further fine-tuned on the AFLW_R/AFLW_M training sets (without annotations), as was done in [2], [3], [7], [8]. The rationale for this is that (i) it does not require any additional superviserion; (ii) it allows the model to adjust for the differences in the face crops provided by the detector. To give an idea of how sensitive the method is to this step, we also report performance without finetuning in the ablation studies below.

AFLW_R landmark regression

Embed. Dim	Model	Error (%IOD)	Links
3	smallnet	10.13	config, model, log
16	smallnet	8.40	config, model, log
32	smallnet	8.18	config, model, log
64	smallnet	7.79	config, model, log
64	hourglass	6.54	config, model, log

AFLW_M landmark regression

AFLW_Mis a dataset of faces which also includes landmark annotations. We use the P = 5 landmark test split (10,122 training images and 2,991 test images). The dataset can be obtained here and is described in this 2011 ICCV workshop paper.

Embed. Dim	Model	Error (%IOD)	Links
3	smallnet	11.12	config, model, log
16	smallnet	9.15	config, model, log
32	smallnet	9.17	config, model, log
64	smallnet	8.60	config, model, log
64	hourglass	7.53	config, model, log

Ablation Studies

We can study the effect of the DVE method by removing it during training and assessing the resulting embeddings for landmark regression. The ablations are performed on the lighter SmallNet model.

Embed. Dim	Model	DVE	Same Identity	Different Identity	Links
3	smallnet	✖️ / ✔️	1.33 / 1.36	2.89 / 3.03	(config, model, log) / (config, model, log)
16	smallnet	✖️ / ✔️	1.25 / 1.28	5.65 / 2.79	(config, model, log) / (config, model, log)
32	smallnet	✖️ / ✔️	1.26 / 1.29	5.81 / 2.79	(config, model, log) / (config, model, log)
64	smallnet	✖️ / ✔️	1.25 / 1.28	5.68 / 2.77	(config, model, log) / (config, model, log)

We see that without DVE, the learned embedding performs reasonably when the dimensionality is restricted to 3d. However, when we seek to learn higher dimensionality embeddings without DVE, they lose their ability to match across different identities. This inability to generalise at higher dimensions is similarly reflected when the embeddings are used to regress landmarks:

DVE Ablation: MAFL landmark regression

Embed. Dim	Model	DVE	Error (%IOD)	Links
3	smallnet	✖️ / ✔️	4.02/4.17	(config, model, log) / (config, model, log)
16	smallnet	✖️ / ✔️	5.31/3.97	(config, model, log) / (config, model, log)
32	smallnet	✖️ / ✔️	5.36/3.82	(config, model, log) / (config, model, log)
64	smallnet	✖️ / ✔️	4.99/3.42	(config, model, log) / (config, model, log)

DVE Ablation: 300w landmark regression

Embed. Dim	Model	DVE	Error (%IOD)	Links
3	smallnet	✖️ / ✔️	8.23/7.66	(config, model, log) / (config, model, log)
16	smallnet	✖️ / ✔️	10.66/6.29	(config, model, log) / (config, model, log)
32	smallnet	✖️ / ✔️	10.33/6.13	(config, model, log) / (config, model, log)
64	smallnet	✖️ / ✔️	9.33/5.75	(config, model, log) / (config, model, log)

DVE Ablation: AFLW_M landmark regression

Embed. Dim	Model	DVE	Error (%IOD)	Links
3	smallnet	✖️ / ✔️	10.99/11.12	(config, model, log) / (config, model, log)
16	smallnet	✖️ / ✔️	12.22/9.15	(config, model, log) / (config, model, log)
32	smallnet	✖️ / ✔️	12.60/9.17	(config, model, log) / (config, model, log)
64	smallnet	✖️ / ✔️	12.92/8.60	(config, model, log) / (config, model, log)

DVE Ablation: AFLW_R landmark regression

Embed. Dim	Model	DVE	Error (%IOD)	Links
3	smallnet	✖️ / ✔️	10.14/10.13	(config, model, log) / (config, model, log)
16	smallnet	✖️ / ✔️	10.73/8.40	(config, model, log) / (config, model, log)
32	smallnet	✖️ / ✔️	11.05/8.18	(config, model, log) / (config, model, log)
64	smallnet	✖️ / ✔️	11.43/7.79	(config, model, log) / (config, model, log)

Next we investigate how sensitive our approach is to finetuning on the target dataset (this is done for the AFLW_R and AFLW_M landmark regressions). We do two sets of experiments. First we, remove the finetuning for both the AFLW dataset variants and re-evaluate on the landmark regression tasks. Second, we add in a finetuning step for a different dataset, 300w, to see how the method is affected on a different benchmark. Note that all models for these experiments use DVE, and the finetuning consists of training the embeddings for an additional 50 epochs without annotations. We see that for the AFLW datasets, it makes a reasonable difference to performance. However, for 300w, particularly for stronger models, it adds little benefit (for this reason we do not use finetuning on 300w for the results reported in the paper).

Finetuning Ablation: AFLW_M landmark regression

Embed. Dim	Model	Finetune	Error (%IOD)	Links
3	smallnet	✖️ / ✔️	11.82/11.12	(config, model, log) / (config, model, log)
16	smallnet	✖️ / ✔️	10.22/9.15	(config, model, log) / (config, model, log)
32	smallnet	✖️ / ✔️	9.80/9.17	(config, model, log) / (config, model, log)
64	smallnet	✖️ / ✔️	9.28/8.60	(config, model, log) / (config, model, log)
64	hourglass	✖️ / ✔️	8.15/7.53	(config, model, log) / (config, model, log)

Finetuning Ablation: AFLW_R landmark regression

Embed. Dim	Model	Finetune	Error (%IOD)	Links
3	smallnet	✖️ / ✔️	9.65/10.13	(config, model, log) / (config, model, log)
16	smallnet	✖️ / ✔️	8.91/8.40	(config, model, log) / (config, model, log)
32	smallnet	✖️ / ✔️	8.73/8.18	(config, model, log) / (config, model, log)
64	smallnet	✖️ / ✔️	8.14/7.79	(config, model, log) / (config, model, log)
64	hourglass	✖️ / ✔️	6.88/6.54	(config, model, log) / (config, model, log)

Finetuning Ablation: 300w landmark regression

Embed. Dim	Model	Finetune	Error (%IOD)	Links
3	smallnet	✖️ / ✔️	7.66/7.20	(config, model, log) / (config, model, log)
16	smallnet	✖️ / ✔️	6.29/5.90	(config, model, log) / (config, model, log)
32	smallnet	✖️ / ✔️	6.13/5.75	(config, model, log) / (config, model, log)
64	smallnet	✖️ / ✔️	5.75/5.58	(config, model, log) / (config, model, log)
64	hourglass	✖️ / ✔️	4.65/4.65	(config, model, log) / (config, model, log)

To enable the finetuning experiments to be reproduced, the training logs for each of the three datasets are provided below, together with their performance on the matching task.

Finetuning on AFLW_M

Embed. Dim	Model	Same Identity	Different Identity	Links
3	smallnet	5.99	7.16	config, model, log
16	smallnet	4.72	7.11	config, model, log
32	smallnet	6.42	8.71	config, model, log
64	smallnet	8.07	10.09	config, model, log
64	hourglass	1.53	3.65	config, model, log

Finetuning on AFLW_R

Embed. Dim	Model	Same Identity	Different Identity	Links
3	smallnet	6.36	7.69	config, model, log
16	smallnet	6.34	8.62	config, model, log
32	smallnet	8.10	10.11	config, model, log
64	smallnet	4.08	5.21	config, model, log
64	hourglass	1.17	4.04	config, model, log

Finetuning on 300w

Embed. Dim	Model	Same Identity	Different Identity	Links
3	smallnet	5.21	6.51	config, model, log
16	smallnet	5.55	7.30	config, model, log
32	smallnet	5.85	7.47	config, model, log
64	smallnet	6.58	8.19	config, model, log
64	hourglass	1.63	3.82	config, model, log

Annotation Ablation: AFLW_M landmark regression with limited labels

We perform a final ablation to investigate how well the regressors are able to perform when their access to annotation is further reduced, and they are simply provided with a few images. The results, shown below, are reported as mean/std over three runs (because when there is only a single annotation, the performance is quite sensitive to which particular annotation is selected). Particularly for the stronger models, reasonable performance can be obtained with a small number of annotated images.

Embed. Dim	Model	DVE	Num annos.	Error (%IOD)	Links
3	smallnet	1	✖️	19.87 (+/- 3.10)	config, model, log
3	smallnet	5	✖️	16.90 (+/- 1.04)	config, model, log
3	smallnet	10	✖️	16.12 (+/- 1.07)	config, model, log
3	smallnet	20	✖️	15.30 (+/- 0.59)	config, model, log
64	smallnet	1	✔️	17.13 (+/- 1.78)	config, model, log
64	smallnet	5	✔️	13.57 (+/- 2.08)	config, model, log
64	smallnet	10	✔️	12.97 (+/- 2.36)	config, model, log
64	smallnet	20	✔️	11.26 (+/- 0.93)	config, model, log
64	hourglass	1	✔️	14.23 (+/- 1.54)	config, model, log
64	hourglass	5	✔️	12.04 (+/- 2.03)	config, model, log
64	hourglass	10	✔️	12.25 (+/- 2.42)	config, model, log
64	hourglass	20	✔️	11.46 (+/- 0.83)	config, model, log

Dataset mirrors

For each dataset used in the paper, we provide a preprocessed copy to allow the results described above to be reproduced directly. These can be downloaded and unpacked with a utility script, which will store them in the locations expected by the training code. Each dataset has a brief README, which also provides the citations for use with each dataset, together with a link from which it can be downloaded directly.

Dataset	Details and links	Archive size	sha1sum
CelebA (+ MAFL)	README	9.0 GiB	`f6872ab0f2df8e5843abe99dc6d6100dd4fea29f`
300w	README	3.0 GiB	`885b09159c61fa29998437747d589c65cfc4ccd3`
AFLW_M	README	252 MiB	`1ff31c07cef4f2777b416d896a65f6c17d8ae2ee`
AFLW_R	README	1.1 GiB	`939fdce0e6262a14159832c71d4f84a9d516de5e`

Additional Notes

In the codebase AFLW<sub>R</sub> is simply referred to as AFLW, while AFLW<sub>M</sub> is referred to as AFLW-MTFL. For 300w, we compute the inter-ocular distance according to the definition given by the dataset organizers here. Some of the logs are generated from existing logfiles that were created with a slightly older version of the codebase (these differences only affect the log format, rather than the training code itself - the log generator can be found here.)

Evaluating a pretrained embedding

Evaluting a pretrained model for a given dataset requires:

The target dataset, which should be located in <root>/data/<dataset-name> (this will be done automatically by the data fetching script, or can be done manually).
A config.json file.
A checkpoint.pth file.

Evaluation is then performed with the following command:

python3 test_matching.py --config <path-to-config.json> --resume <path-to-trained_model.pth> --device <gpu-id>

where <gpu-id> is the index of the GPU to evaluate on. This option can be ommitted to run the evaluation on the CPU.

For example, to reproduce the smallnet-32d-dve results described above, run the following sequence of commands:

# fetch the mafl dataset (contained with celeba) 
python misc/sync_datasets.py --dataset celeba

# find the name of a pretrained model using the links in the tables above 
export MODEL=data/models/celeba-smallnet-32d-dve/2019-08-02_06-19-59/checkpoint-epoch100.pth

# create a local directory and download the model into it 
mkdir -p $(dirname "${MODEL}")
wget --output-document="${MODEL}" "http://www.robots.ox.ac.uk/~vgg/research/DVE/${MODEL}"

# Evaluate the model
python3 test_matching.py --config configs/celeba/smallnet-32d-dve.json --resume ${MODEL} --device 0

Regressing landmarks

Learning a landmark regressor for a given pretrained embedding requires:

The target dataset, which should be located in <root>/data/<dataset-name> (this will be done automatically by the data fetching script, or can be done manually).
A config.json file.
A checkpoint.pth file.

See the regressor code for details of how the regressor is implemented (it consists of a conv, then a spatial softmax, then a group conv).

Landmark learning is then performed with the following command:

python3 train.py --config <path-to-config.json> --resume <path-to-trained_model.pth> --device <gpu-id>

where <gpu-id> is the index of the GPU to evaluate on. This option can be ommitted to run the evaluation on the CPU.

For example, to reproduce the smallnet-32d-dve landmark regression results described above, run the following sequence of commands:

# fetch the mafl dataset (contained with celeba) 
python misc/sync_datasets.py --dataset celeba

# find the name of a pretrained model using the links in the tables above 
export MODEL=data/models/celeba-smallnet-32d-dve/2019-08-08_17-56-24/checkpoint-epoch100.pth

# create a local directory and download the model into it 
mkdir -p $(dirname "${MODEL}")
wget --output-document="${MODEL}" "http://www.robots.ox.ac.uk/~vgg/research/DVE/${MODEL}"

# Evaluate the features by training a keypoint regressor 
python3 train.py --config configs/aflw-keypoints/celeba-smallnet-32d-dve.json --device 0

Learning new embeddings

Learning a new embedding requires:

The dataset used for training, which should be located in <root>/data/<dataset-name> (this will be done automatically by the data fetching script, or can be done manually).
A config.json file. You can define your own, or use one of the provided configs in the configs directory.

Training is then performed with the following command:

python3 train.py --config <path-to-config.json> --device <gpu-id>

where <gpu-id> is the index of the GPU to train on. This option can be ommitted to run the training on the CPU.

For example, to train a 16d-dve embedding on celeba, run the following sequence of commands:

# fetch the celeba dataset 
python misc/sync_datasets.py --dataset celeba

# Train the model
python3 train.py --config configs/celeba/smallnet-16d-dve.json --device 0

Dependencies

If you have enough disk space, the recommended approach to installing the dependencies for this project is to create a conda enviroment via the requirements/conda-freeze.txt:

conda env create -f requirements/conda-freeze.yml

Otherwise, if you'd prefer to take a leaner approach, you can either:

pip/conda install each missing package each time you hit an ImportError
manually inspect the slightly more readable requirements/pip-requirements.txt

Citation

If you find this code useful, please consider citing:

@inproceedings{Thewlis2019a,
  author    = {Thewlis, J. and Albanie, S. and Bilen, H. and Vedaldi, A.},
  booktitle = {International Conference on Computer Vision},
  title     = {Unsupervised learning of landmarks by exchanging descriptor vectors},
  date      = {2019},
}

Related useful codebases

Some other codebases you might like to check out if you are interested in self-supervised learning of object structure.

LMDIS-REP [7]
IMM [8]
Fab-Net [9]

Acknowledgements

We would like to thank Almut Sophia Koepke for helpful discussions. The project structure uses the pytorch-template by @victoresque.

References

[1] James Thewlis, Samuel Albanie, Hakan Bilen, and Andrea Vedaldi. "Unsupervised learning of landmarks by exchanging descriptor vectors" ICCV 2019.

[2] James Thewlis, Hakan Bilen and Andrea Vedaldi, "Unsupervised learning of object landmarks by factorized spatial embeddings." ICCV 2017.

[3] James Thewlis, Hakan Bilen and Andrea Vedaldi, "Unsupervised learning of object frames by dense equivariant image labelling." NeurIPS 2017

[4] Sundaram, N., Brox, T., & Keutzer, K. "Dense point trajectories by GPU-accelerated large displacement optical flow", ECCV 2010

[5] C. Zach, M. Klopschitz, and M. Pollefeys. "Disambiguating visual relations using loop constraints", CVPR, 2010

[6] Zhou, T., Jae Lee, Y., Yu, S. X., & Efros, A. A. "Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences". CVPR 2015.

[7] Zhang, Yuting, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, and Honglak Lee. "Unsupervised discovery of object landmarks as structural representations.", CVPR 2018

[8] Jakab, T., Gupta, A., Bilen, H., & Vedaldi, A. Unsupervised learning of object landmarks through conditional image generation, NeurIPS 2018

[9] Olivia Wiles, A. Sophia Koepke and Andrew Zisserman. "Self-supervised learning of a facial attribute embedding from video" , BMVC 2018

Name		Name	Last commit message	Last commit date
Latest commit History 480 Commits
base		base
configs		configs
data_loader		data_loader
figs		figs
logger		logger
misc		misc
model		model
requirements		requirements
trainer		trainer
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
parse_config.py		parse_config.py
test_matching.py		test_matching.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Descriptor Vector Exchange

Learned Embeddings

Landmark Regression

Ablation Studies

Dataset mirrors

Additional Notes

Evaluating a pretrained embedding

Regressing landmarks

Learning new embeddings

Dependencies

Citation

Related useful codebases

Acknowledgements

References

About

Releases

Packages

Contributors 3

Languages

License

jamt9000/DVE

Folders and files

Latest commit

History

Repository files navigation

Descriptor Vector Exchange

Learned Embeddings

Landmark Regression

Ablation Studies

Dataset mirrors

Additional Notes

Evaluating a pretrained embedding

Regressing landmarks

Learning new embeddings

Dependencies

Citation

Related useful codebases

Acknowledgements

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages