This repo provides code for learning dense landmarks without supervision. Our approach is described in the ICCV 2019 paper "Unsupervised learning of landmarks by exchanging descriptor vectors".
High level Overview: The goal of this work is to learn a dense embedding Φu(x) ∈ RC of image pixels without annotation. Our starting point was the Dense Equivariant Labelling approach of [3] (references follow at the end of the README), which similarly tackles the same problem, but is restricted to learning low-dimensional embeddings to achieve the key objective of generalisation across different identities. The key focus of Descriptor Vector Exchange (DVE) is to address this dimensionality issue to enable the learning of more powerful, higher dimensional embeddings while still preserving their generalisation ability. To do so, we take inspiration from methods which enforce transitive/cyclic consistency constraints [4, 5, 6].
The embedding is learned from pairs of images (x,x′) related by a known warp v = g(u). In the image above, on the left we show the approach used by [3], which directly matches embedding Φu(x) from the left image to embeddings Φv(x′) in the right image to generate a loss. On the right, DVE replaces Φu(x) with its reconstruction Φˆu(x|xα) obtained from the embeddings in a third auxiliary image xα (the correspondence with xα does not need to be known). This mechanism encourages the embeddings to act consistently across different instances, even when the dimensionality is increased (see the paper for more details).
Requirements: The code assumes PyTorch 1.1 and Python 3.6/3.7 (other versions may work, but have not been tested). See the section on dependencies towards the end of this file for specific package requirements.
We provide pretrained models for each dataset to reproduce the results reported in the paper [1]. The training is performed with CelebA, a dataset of over 200k faces of celebrities that was originally described in this paper. We use this dataset to train our embedding function without annotations.
Each model is accompanied by training and evaluation logs and its mean pixel error performance on the task of matching annotated landmarks across the MAFL test set (described in more detail below). We use two architectures: the smallnet model of [3] and the more powerful hourglass model, inspired by its effectiveness in [7].
The goal of these initial experiments is to demonstrate that DVE allows models to generalise across identities even when using higher dimensional embeddings (e.g. 64d rather than 3d). By contrast, this does not occur when DVE is removed (see the ablation section below).
Embed. Dim | Model | Same Identity | Different Identity | Params | Links |
---|---|---|---|---|---|
3 | smallnet | 1.36 | 3.03 | 334.9k | config, model, log |
16 | smallnet | 1.28 | 2.79 | 338.2k | config, model, log |
32 | smallnet | 1.29 | 2.79 | 342.3k | config, model, log |
64 | smallnet | 1.28 | 2.77 | 350.6k | config, model, log |
64 | hourglass | 0.93 | 2.37 | 12.6M | config, model, log |
Notes: The error metrics for the hourglass
model, which are included for completeness, are approximately (but are not exactly) comparable to the metrics for the smallnet
due to very slight differences in the cropping ratios used by the two architectures (0.3 for smallnet, 0.294 for Hourglass).
Protocol Description: To transform the learned dense embeddings into landmark predictions, we use the same approach as [3]. For each target dataset, we freeze the dense embeddings and learn to peg onto them a collection of 50 "virtual" keypoints via a spatial softmax. These virtual keypoints are then used to regress the target keypoints of the dataset. We report the error as a percentage of inter-ocular distance (a metric defined by the landmarks of each dataset).
MAFL landmark regression
MAFL is a dataset of 20k faces which includes landmark annotations. The dataset is partitioned into 19k training images and 1k testing images.
Embed. Dim | Model | Error (%IOD) | Links |
---|---|---|---|
3 | smallnet | 4.17 | config, model, log |
16 | smallnet | 3.97 | config, model, log |
32 | smallnet | 3.82 | config, model, log |
64 | smallnet | 3.42 | config, model, log |
64 | hourglass | 2.86 | config, model, log |
300-W landmark regression
The 300-W This dataset contains 3,148 training images and 689 testing images with 68 facial landmark annotations for each face (with the split introduced this this CVPR 2014 paper). The dataset is described in this 2013 ICCV workshop paper.
Embed. Dim | Model | Error (%IOD) | Links |
---|---|---|---|
3 | smallnet | 7.66 | config, model, log |
16 | smallnet | 6.29 | config, model, log |
32 | smallnet | 6.13 | config, model, log |
64 | smallnet | 5.75 | config, model, log |
64 | hourglass | 4.65 | config, model, log |
AFLW landmark regression
The original AFLW contains around 25k images with up to 21 landmarks. For the purposes of evaluating five-landmark detectors, the authors of TCDCN introduced a test subset of almost 3K faces (for convenience, we include a mirror version of these images, but you can obtain the originals here)
There are two slightly different partitions of AFLW that have been used in prior work (we report numbers on both to allow for comparison). One is a set of recropped faces released by [7] (2991 test faces with 132 duplicates, 10122 train faces) (here we call this AFLWR). The second is the train/test partition of AFLW used in the works of [2,3] which used the existing crops from MTFL (2995 faces) for testing and 10122 AFLW faces for training (we call this dataset split AFLWM).
Additionally, in the tables immediately below, each embedding is further fine-tuned on the AFLWR/AFLWM training sets (without annotations), as was done in [2], [3], [7], [8]. The rationale for this is that (i) it does not require any additional superviserion; (ii) it allows the model to adjust for the differences in the face crops provided by the detector. To give an idea of how sensitive the method is to this step, we also report performance without finetuning in the ablation studies below.
AFLWR landmark regression
Embed. Dim | Model | Error (%IOD) | Links |
---|---|---|---|
3 | smallnet | 10.13 | config, model, log |
16 | smallnet | 8.40 | config, model, log |
32 | smallnet | 8.18 | config, model, log |
64 | smallnet | 7.79 | config, model, log |
64 | hourglass | 6.54 | config, model, log |
AFLWM landmark regression
AFLWMis a dataset of faces which also includes landmark annotations. We use the P = 5 landmark test split (10,122 training images and 2,991 test images). The dataset can be obtained here and is described in this 2011 ICCV workshop paper.
Embed. Dim | Model | Error (%IOD) | Links |
---|---|---|---|
3 | smallnet | 11.12 | config, model, log |
16 | smallnet | 9.15 | config, model, log |
32 | smallnet | 9.17 | config, model, log |
64 | smallnet | 8.60 | config, model, log |
64 | hourglass | 7.53 | config, model, log |
We can study the effect of the DVE method by removing it during training and assessing the resulting embeddings for landmark regression. The ablations are performed on the lighter SmallNet model.
Embed. Dim | Model | DVE | Same Identity | Different Identity | Links |
---|---|---|---|---|---|
3 | smallnet | ✖️ / ✔️ | 1.33 / 1.36 | 2.89 / 3.03 | (config, model, log) / (config, model, log) |
16 | smallnet | ✖️ / ✔️ | 1.25 / 1.28 | 5.65 / 2.79 | (config, model, log) / (config, model, log) |
32 | smallnet | ✖️ / ✔️ | 1.26 / 1.29 | 5.81 / 2.79 | (config, model, log) / (config, model, log) |
64 | smallnet | ✖️ / ✔️ | 1.25 / 1.28 | 5.68 / 2.77 | (config, model, log) / (config, model, log) |
We see that without DVE, the learned embedding performs reasonably when the dimensionality is restricted to 3d. However, when we seek to learn higher dimensionality embeddings without DVE, they lose their ability to match across different identities. This inability to generalise at higher dimensions is similarly reflected when the embeddings are used to regress landmarks:
DVE Ablation: MAFL landmark regression
Embed. Dim | Model | DVE | Error (%IOD) | Links |
---|---|---|---|---|
3 | smallnet | ✖️ / ✔️ | 4.02/4.17 | (config, model, log) / (config, model, log) |
16 | smallnet | ✖️ / ✔️ | 5.31/3.97 | (config, model, log) / (config, model, log) |
32 | smallnet | ✖️ / ✔️ | 5.36/3.82 | (config, model, log) / (config, model, log) |
64 | smallnet | ✖️ / ✔️ | 4.99/3.42 | (config, model, log) / (config, model, log) |
DVE Ablation: 300w landmark regression
Embed. Dim | Model | DVE | Error (%IOD) | Links |
---|---|---|---|---|
3 | smallnet | ✖️ / ✔️ | 8.23/7.66 | (config, model, log) / (config, model, log) |
16 | smallnet | ✖️ / ✔️ | 10.66/6.29 | (config, model, log) / (config, model, log) |
32 | smallnet | ✖️ / ✔️ | 10.33/6.13 | (config, model, log) / (config, model, log) |
64 | smallnet | ✖️ / ✔️ | 9.33/5.75 | (config, model, log) / (config, model, log) |
DVE Ablation: AFLWM landmark regression
Embed. Dim | Model | DVE | Error (%IOD) | Links |
---|---|---|---|---|
3 | smallnet | ✖️ / ✔️ | 10.99/11.12 | (config, model, log) / (config, model, log) |
16 | smallnet | ✖️ / ✔️ | 12.22/9.15 | (config, model, log) / (config, model, log) |
32 | smallnet | ✖️ / ✔️ | 12.60/9.17 | (config, model, log) / (config, model, log) |
64 | smallnet | ✖️ / ✔️ | 12.92/8.60 | (config, model, log) / (config, model, log) |
DVE Ablation: AFLWR landmark regression
Embed. Dim | Model | DVE | Error (%IOD) | Links |
---|---|---|---|---|
3 | smallnet | ✖️ / ✔️ | 10.14/10.13 | (config, model, log) / (config, model, log) |
16 | smallnet | ✖️ / ✔️ | 10.73/8.40 | (config, model, log) / (config, model, log) |
32 | smallnet | ✖️ / ✔️ | 11.05/8.18 | (config, model, log) / (config, model, log) |
64 | smallnet | ✖️ / ✔️ | 11.43/7.79 | (config, model, log) / (config, model, log) |
Next we investigate how sensitive our approach is to finetuning on the target dataset (this is done for the AFLWR and AFLWM landmark regressions). We do two sets of experiments. First we, remove the finetuning for both the AFLW dataset variants and re-evaluate on the landmark regression tasks. Second, we add in a finetuning step for a different dataset, 300w, to see how the method is affected on a different benchmark. Note that all models for these experiments use DVE, and the finetuning consists of training the embeddings for an additional 50 epochs without annotations. We see that for the AFLW datasets, it makes a reasonable difference to performance. However, for 300w, particularly for stronger models, it adds little benefit (for this reason we do not use finetuning on 300w for the results reported in the paper).
Finetuning Ablation: AFLWM landmark regression
Embed. Dim | Model | Finetune | Error (%IOD) | Links |
---|---|---|---|---|
3 | smallnet | ✖️ / ✔️ | 11.82/11.12 | (config, model, log) / (config, model, log) |
16 | smallnet | ✖️ / ✔️ | 10.22/9.15 | (config, model, log) / (config, model, log) |
32 | smallnet | ✖️ / ✔️ | 9.80/9.17 | (config, model, log) / (config, model, log) |
64 | smallnet | ✖️ / ✔️ | 9.28/8.60 | (config, model, log) / (config, model, log) |
64 | hourglass | ✖️ / ✔️ | 8.15/7.53 | (config, model, log) / (config, model, log) |
Finetuning Ablation: AFLWR landmark regression
Embed. Dim | Model | Finetune | Error (%IOD) | Links |
---|---|---|---|---|
3 | smallnet | ✖️ / ✔️ | 9.65/10.13 | (config, model, log) / (config, model, log) |
16 | smallnet | ✖️ / ✔️ | 8.91/8.40 | (config, model, log) / (config, model, log) |
32 | smallnet | ✖️ / ✔️ | 8.73/8.18 | (config, model, log) / (config, model, log) |
64 | smallnet | ✖️ / ✔️ | 8.14/7.79 | (config, model, log) / (config, model, log) |
64 | hourglass | ✖️ / ✔️ | 6.88/6.54 | (config, model, log) / (config, model, log) |
Finetuning Ablation: 300w landmark regression
Embed. Dim | Model | Finetune | Error (%IOD) | Links |
---|---|---|---|---|
3 | smallnet | ✖️ / ✔️ | 7.66/7.20 | (config, model, log) / (config, model, log) |
16 | smallnet | ✖️ / ✔️ | 6.29/5.90 | (config, model, log) / (config, model, log) |
32 | smallnet | ✖️ / ✔️ | 6.13/5.75 | (config, model, log) / (config, model, log) |
64 | smallnet | ✖️ / ✔️ | 5.75/5.58 | (config, model, log) / (config, model, log) |
64 | hourglass | ✖️ / ✔️ | 4.65/4.65 | (config, model, log) / (config, model, log) |
To enable the finetuning experiments to be reproduced, the training logs for each of the three datasets are provided below, together with their performance on the matching task.
Finetuning on AFLWM
Embed. Dim | Model | Same Identity | Different Identity | Links |
---|---|---|---|---|
3 | smallnet | 5.99 | 7.16 | config, model, log |
16 | smallnet | 4.72 | 7.11 | config, model, log |
32 | smallnet | 6.42 | 8.71 | config, model, log |
64 | smallnet | 8.07 | 10.09 | config, model, log |
64 | hourglass | 1.53 | 3.65 | config, model, log |
Finetuning on AFLWR
Embed. Dim | Model | Same Identity | Different Identity | Links |
---|---|---|---|---|
3 | smallnet | 6.36 | 7.69 | config, model, log |
16 | smallnet | 6.34 | 8.62 | config, model, log |
32 | smallnet | 8.10 | 10.11 | config, model, log |
64 | smallnet | 4.08 | 5.21 | config, model, log |
64 | hourglass | 1.17 | 4.04 | config, model, log |
Finetuning on 300w
Embed. Dim | Model | Same Identity | Different Identity | Links |
---|---|---|---|---|
3 | smallnet | 5.21 | 6.51 | config, model, log |
16 | smallnet | 5.55 | 7.30 | config, model, log |
32 | smallnet | 5.85 | 7.47 | config, model, log |
64 | smallnet | 6.58 | 8.19 | config, model, log |
64 | hourglass | 1.63 | 3.82 | config, model, log |
Annotation Ablation: AFLWM landmark regression with limited labels
We perform a final ablation to investigate how well the regressors are able to perform when their access to annotation is further reduced, and they are simply provided with a few images. The results, shown below, are reported as mean/std over three runs (because when there is only a single annotation, the performance is quite sensitive to which particular annotation is selected). Particularly for the stronger models, reasonable performance can be obtained with a small number of annotated images.
Embed. Dim | Model | DVE | Num annos. | Error (%IOD) | Links |
---|---|---|---|---|---|
3 | smallnet | 1 | ✖️ | 19.87 (+/- 3.10) | config, model, log |
3 | smallnet | 5 | ✖️ | 16.90 (+/- 1.04) | config, model, log |
3 | smallnet | 10 | ✖️ | 16.12 (+/- 1.07) | config, model, log |
3 | smallnet | 20 | ✖️ | 15.30 (+/- 0.59) | config, model, log |
64 | smallnet | 1 | ✔️ | 17.13 (+/- 1.78) | config, model, log |
64 | smallnet | 5 | ✔️ | 13.57 (+/- 2.08) | config, model, log |
64 | smallnet | 10 | ✔️ | 12.97 (+/- 2.36) | config, model, log |
64 | smallnet | 20 | ✔️ | 11.26 (+/- 0.93) | config, model, log |
64 | hourglass | 1 | ✔️ | 14.23 (+/- 1.54) | config, model, log |
64 | hourglass | 5 | ✔️ | 12.04 (+/- 2.03) | config, model, log |
64 | hourglass | 10 | ✔️ | 12.25 (+/- 2.42) | config, model, log |
64 | hourglass | 20 | ✔️ | 11.46 (+/- 0.83) | config, model, log |
For each dataset used in the paper, we provide a preprocessed copy to allow the results described above to be reproduced directly. These can be downloaded and unpacked with a utility script, which will store them in the locations expected by the training code. Each dataset has a brief README, which also provides the citations for use with each dataset, together with a link from which it can be downloaded directly.
Dataset | Details and links | Archive size | sha1sum |
---|---|---|---|
CelebA (+ MAFL) | README | 9.0 GiB | f6872ab0f2df8e5843abe99dc6d6100dd4fea29f |
300w | README | 3.0 GiB | 885b09159c61fa29998437747d589c65cfc4ccd3 |
AFLWM | README | 252 MiB | 1ff31c07cef4f2777b416d896a65f6c17d8ae2ee |
AFLWR | README | 1.1 GiB | 939fdce0e6262a14159832c71d4f84a9d516de5e |
In the codebase AFLW<sub>R</sub>
is simply referred to as AFLW
, while AFLW<sub>M</sub>
is referred to as AFLW-MTFL
. For 300w, we compute the inter-ocular distance according to the definition given by the dataset organizers here. Some of the logs are generated from existing logfiles that were created with a slightly older version of the codebase (these differences only affect the log format, rather than the training code itself - the log generator can be found here.)
Evaluting a pretrained model for a given dataset requires:
- The target dataset, which should be located in
<root>/data/<dataset-name>
(this will be done automatically by the data fetching script, or can be done manually). - A
config.json
file. - A
checkpoint.pth
file.
Evaluation is then performed with the following command:
python3 test_matching.py --config <path-to-config.json> --resume <path-to-trained_model.pth> --device <gpu-id>
where <gpu-id>
is the index of the GPU to evaluate on. This option can be ommitted to run the evaluation on the CPU.
For example, to reproduce the smallnet-32d-dve
results described above, run the following sequence of commands:
# fetch the mafl dataset (contained with celeba)
python misc/sync_datasets.py --dataset celeba
# find the name of a pretrained model using the links in the tables above
export MODEL=data/models/celeba-smallnet-32d-dve/2019-08-02_06-19-59/checkpoint-epoch100.pth
# create a local directory and download the model into it
mkdir -p $(dirname "${MODEL}")
wget --output-document="${MODEL}" "http://www.robots.ox.ac.uk/~vgg/research/DVE/${MODEL}"
# Evaluate the model
python3 test_matching.py --config configs/celeba/smallnet-32d-dve.json --resume ${MODEL} --device 0
Learning a landmark regressor for a given pretrained embedding requires:
- The target dataset, which should be located in
<root>/data/<dataset-name>
(this will be done automatically by the data fetching script, or can be done manually). - A
config.json
file. - A
checkpoint.pth
file.
See the regressor code for details of how the regressor is implemented (it consists of a conv, then a spatial softmax, then a group conv).
Landmark learning is then performed with the following command:
python3 train.py --config <path-to-config.json> --resume <path-to-trained_model.pth> --device <gpu-id>
where <gpu-id>
is the index of the GPU to evaluate on. This option can be ommitted to run the evaluation on the CPU.
For example, to reproduce the smallnet-32d-dve
landmark regression results described above, run the following sequence of commands:
# fetch the mafl dataset (contained with celeba)
python misc/sync_datasets.py --dataset celeba
# find the name of a pretrained model using the links in the tables above
export MODEL=data/models/celeba-smallnet-32d-dve/2019-08-08_17-56-24/checkpoint-epoch100.pth
# create a local directory and download the model into it
mkdir -p $(dirname "${MODEL}")
wget --output-document="${MODEL}" "http://www.robots.ox.ac.uk/~vgg/research/DVE/${MODEL}"
# Evaluate the features by training a keypoint regressor
python3 train.py --config configs/aflw-keypoints/celeba-smallnet-32d-dve.json --device 0
Learning a new embedding requires:
- The dataset used for training, which should be located in
<root>/data/<dataset-name>
(this will be done automatically by the data fetching script, or can be done manually). - A
config.json
file. You can define your own, or use one of the provided configs in the configs directory.
Training is then performed with the following command:
python3 train.py --config <path-to-config.json> --device <gpu-id>
where <gpu-id>
is the index of the GPU to train on. This option can be ommitted to run the training on the CPU.
For example, to train a 16d-dve
embedding on celeba
, run the following sequence of commands:
# fetch the celeba dataset
python misc/sync_datasets.py --dataset celeba
# Train the model
python3 train.py --config configs/celeba/smallnet-16d-dve.json --device 0
If you have enough disk space, the recommended approach to installing the dependencies for this project is to create a conda enviroment via the requirements/conda-freeze.txt
:
conda env create -f requirements/conda-freeze.yml
Otherwise, if you'd prefer to take a leaner approach, you can either:
pip/conda install
each missing package each time you hit anImportError
- manually inspect the slightly more readable
requirements/pip-requirements.txt
If you find this code useful, please consider citing:
@inproceedings{Thewlis2019a,
author = {Thewlis, J. and Albanie, S. and Bilen, H. and Vedaldi, A.},
booktitle = {International Conference on Computer Vision},
title = {Unsupervised learning of landmarks by exchanging descriptor vectors},
date = {2019},
}
Some other codebases you might like to check out if you are interested in self-supervised learning of object structure.
We would like to thank Almut Sophia Koepke for helpful discussions. The project structure uses the pytorch-template by @victoresque.
[1] James Thewlis, Samuel Albanie, Hakan Bilen, and Andrea Vedaldi. "Unsupervised learning of landmarks by exchanging descriptor vectors" ICCV 2019.
[2] James Thewlis, Hakan Bilen and Andrea Vedaldi, "Unsupervised learning of object landmarks by factorized spatial embeddings." ICCV 2017.
[3] James Thewlis, Hakan Bilen and Andrea Vedaldi, "Unsupervised learning of object frames by dense equivariant image labelling." NeurIPS 2017
[4] Sundaram, N., Brox, T., & Keutzer, K. "Dense point trajectories by GPU-accelerated large displacement optical flow", ECCV 2010
[5] C. Zach, M. Klopschitz, and M. Pollefeys. "Disambiguating visual relations using loop constraints", CVPR, 2010
[6] Zhou, T., Jae Lee, Y., Yu, S. X., & Efros, A. A. "Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences". CVPR 2015.
[7] Zhang, Yuting, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, and Honglak Lee. "Unsupervised discovery of object landmarks as structural representations.", CVPR 2018
[8] Jakab, T., Gupta, A., Bilen, H., & Vedaldi, A. Unsupervised learning of object landmarks through conditional image generation, NeurIPS 2018
[9] Olivia Wiles, A. Sophia Koepke and Andrew Zisserman. "Self-supervised learning of a facial attribute embedding from video" , BMVC 2018