Add CelebA Dataset to Plato #164

cuiboyuan · 2022-05-08T19:33:31Z

This pull request adds implementations that import CelebA dataset (http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) to Plato.

Description

This pull request adds the file plato/datasources/celeba.py which downloads and pre-processes the images and labels from CelebA properly for training and testing set. The pre-processing steps of images were borrowed from the DCGAN PyTorch tutorial (https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html).

The CelebA dataset contains four types of labels of images: face bounding box, face landmarks (i.e. positions of eyes, mouth, and nose), facial attributes (e.g. wearing glasses, blonde), and face identity; but celeba.py currently only includes attributes and identities to the training and testing set. The major reason for doing this is because after the resizing and cropping of images in the pre-processing step, the original coordinates of landmarks and bounding box are probably not correct anymore for the transformed image. Additionally, the bounding box and landmark labels are mainly used for face localization tasks, which, I think, is not currently supported in Plato, so I believe it is not quite worthwhile to include these labels in CelebA. I think in the future if we ever need to include the bounding box and landmarks data, we can update celeba.py to transform these labels in corresponding to the resizing and cropping of the images.

Although the CelebA dataset comes directly from torchvision, it is not used directly in celeba.py. The torchvision.datasets.CelebA is wrapped by a local CelebA class. The purpose of doing this is to add targets and classes attributes (which do not exist in the original class) to the training and test set of CelebA. These two class attributes are used by non-IID samplers as the basis to distribute training data with bias across the clients. Here, we just use the identity of each face in images as the targets, such that each client will likely contain much more data of some individuals than the others, which, I believe, reflects the scenario in real life. With this addition, the CelebA data should work fine with the existing IID and non-IID samplers.

One major problem with CelebA data is that users will often fail to download CelebA data with torchvision.datasets.CelebA. The creators of CelebA host the data on Google Drive, and there is a daily limit of download capacity enforced by Google, so when the limit is reached, the user cannot download the data anymore for the day. This is an issue recognized by the PyTorch developers, and there is a detailed description here: pytorch/vision#1920

According to the developers, there really is no solution other than waiting for the limit to reset since they cannot change where the data is hosted. The current way to work around this is for users to download all the files manually from the CelebA authors to their local machines under directory <data path>/celeba and unzip the compressed file in the same directory before running anything, where <data path> is the data path specified by the users in the config.

How has this been tested?

Individual methods in celeba.DataSource class are tested in local test files. The entire data set has not yet been tested for training. The dataset is expected to be tested when the DCGAN model is introduced

Types of changes

Bug fix (non-breaking change which fixes an issue) Fixes #
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

baochunli · 2022-05-08T22:47:06Z

Are we able to automatically download the dataset from a different URL? Please see CINIC-10 dataset as an example of this feature. We can host the dataset on our own web server.

…number of classes

cuiboyuan · 2022-05-11T23:58:20Z

So I tried to run DataSource.download(<data url>, <data path>) on the GPU server, where <data url> is <GPU server URL>/~<my GPU username>/celeba.tar.gz, but I got a certificate error:

requests.exceptions.SSLError: HTTPSConnectionPool(host=<GPU server>, port=443): Max retries exceeded with url: /~<my username>/celeba.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))

Any idea what causes this error and how to resolve it?

btw, I got the same error when I try to run the config file for CINIC-10 on the GPU server

baochunli · 2022-05-12T04:09:23Z

Please use http:// instead. I have reconfigured the server to make it work.

cuiboyuan added 4 commits May 7, 2022 19:49

Add CelebA dataset and tests

a726b7f

Update CelebA impl and remove temp tests

324c869

Update CelebA transforms and add some docs

d3cf459

Add CelebA to registry and add target transform to CelebA dataset

59111b5

baochunli requested a review from NingxinSu May 8, 2022 22:47

Add yaml file for CelebA testing and update ResNet to support custom …

96df41e

…number of classes

cuiboyuan added 3 commits May 12, 2022 12:05

Download celeba from our own web server

af07466

Provide working CelebA download URL

0d9c713

Update CINIC-10 download url

1e8d332

baochunli merged commit 9801646 into main May 12, 2022

baochunli deleted the celebA branch May 12, 2022 17:37

baochunli restored the celebA branch May 12, 2022 17:40

baochunli deleted the celebA branch May 12, 2022 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CelebA Dataset to Plato #164

Add CelebA Dataset to Plato #164

cuiboyuan commented May 8, 2022

baochunli commented May 8, 2022

cuiboyuan commented May 11, 2022 •

edited

Loading

baochunli commented May 12, 2022

Add CelebA Dataset to Plato #164

Add CelebA Dataset to Plato #164

Conversation

cuiboyuan commented May 8, 2022

Description

How has this been tested?

Types of changes

Checklist:

baochunli commented May 8, 2022

cuiboyuan commented May 11, 2022 • edited Loading

baochunli commented May 12, 2022

cuiboyuan commented May 11, 2022 •

edited

Loading