Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CelebA Dataset to Plato #164

Merged
merged 8 commits into from
May 12, 2022
Merged

Add CelebA Dataset to Plato #164

merged 8 commits into from
May 12, 2022

Conversation

cuiboyuan
Copy link
Collaborator

This pull request adds implementations that import CelebA dataset (http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) to Plato.

Description

This pull request adds the file plato/datasources/celeba.py which downloads and pre-processes the images and labels from CelebA properly for training and testing set. The pre-processing steps of images were borrowed from the DCGAN PyTorch tutorial (https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html).

The CelebA dataset contains four types of labels of images: face bounding box, face landmarks (i.e. positions of eyes, mouth, and nose), facial attributes (e.g. wearing glasses, blonde), and face identity; but celeba.py currently only includes attributes and identities to the training and testing set. The major reason for doing this is because after the resizing and cropping of images in the pre-processing step, the original coordinates of landmarks and bounding box are probably not correct anymore for the transformed image. Additionally, the bounding box and landmark labels are mainly used for face localization tasks, which, I think, is not currently supported in Plato, so I believe it is not quite worthwhile to include these labels in CelebA. I think in the future if we ever need to include the bounding box and landmarks data, we can update celeba.py to transform these labels in corresponding to the resizing and cropping of the images.

Although the CelebA dataset comes directly from torchvision, it is not used directly in celeba.py. The torchvision.datasets.CelebA is wrapped by a local CelebA class. The purpose of doing this is to add targets and classes attributes (which do not exist in the original class) to the training and test set of CelebA. These two class attributes are used by non-IID samplers as the basis to distribute training data with bias across the clients. Here, we just use the identity of each face in images as the targets, such that each client will likely contain much more data of some individuals than the others, which, I believe, reflects the scenario in real life. With this addition, the CelebA data should work fine with the existing IID and non-IID samplers.

One major problem with CelebA data is that users will often fail to download CelebA data with torchvision.datasets.CelebA. The creators of CelebA host the data on Google Drive, and there is a daily limit of download capacity enforced by Google, so when the limit is reached, the user cannot download the data anymore for the day. This is an issue recognized by the PyTorch developers, and there is a detailed description here: pytorch/vision#1920

According to the developers, there really is no solution other than waiting for the limit to reset since they cannot change where the data is hosted. The current way to work around this is for users to download all the files manually from the CelebA authors to their local machines under directory <data path>/celeba and unzip the compressed file in the same directory before running anything, where <data path> is the data path specified by the users in the config.

How has this been tested?

Individual methods in celeba.DataSource class are tested in local test files. The entire data set has not yet been tested for training. The dataset is expected to be tested when the DCGAN model is introduced

Types of changes

  • Bug fix (non-breaking change which fixes an issue) Fixes #
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@baochunli
Copy link
Collaborator

Are we able to automatically download the dataset from a different URL? Please see CINIC-10 dataset as an example of this feature. We can host the dataset on our own web server.

@baochunli baochunli requested a review from NingxinSu May 8, 2022 22:47
@cuiboyuan
Copy link
Collaborator Author

cuiboyuan commented May 11, 2022

So I tried to run DataSource.download(<data url>, <data path>) on the GPU server, where <data url> is <GPU server URL>/~<my GPU username>/celeba.tar.gz, but I got a certificate error:

requests.exceptions.SSLError: HTTPSConnectionPool(host=<GPU server>, port=443): Max retries exceeded with url: /~<my username>/celeba.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))

Any idea what causes this error and how to resolve it?

btw, I got the same error when I try to run the config file for CINIC-10 on the GPU server

@baochunli
Copy link
Collaborator

Please use http:// instead. I have reconfigured the server to make it work.

@baochunli baochunli merged commit 9801646 into main May 12, 2022
@baochunli baochunli deleted the celebA branch May 12, 2022 17:37
@baochunli baochunli restored the celebA branch May 12, 2022 17:40
@baochunli baochunli deleted the celebA branch May 12, 2022 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants