Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request adds implementations that import CelebA dataset (http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) to Plato.
Description
This pull request adds the file
plato/datasources/celeba.py
which downloads and pre-processes the images and labels from CelebA properly for training and testing set. The pre-processing steps of images were borrowed from the DCGAN PyTorch tutorial (https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html).The CelebA dataset contains four types of labels of images: face bounding box, face landmarks (i.e. positions of eyes, mouth, and nose), facial attributes (e.g. wearing glasses, blonde), and face identity; but
celeba.py
currently only includes attributes and identities to the training and testing set. The major reason for doing this is because after the resizing and cropping of images in the pre-processing step, the original coordinates of landmarks and bounding box are probably not correct anymore for the transformed image. Additionally, the bounding box and landmark labels are mainly used for face localization tasks, which, I think, is not currently supported in Plato, so I believe it is not quite worthwhile to include these labels in CelebA. I think in the future if we ever need to include the bounding box and landmarks data, we can updateceleba.py
to transform these labels in corresponding to the resizing and cropping of the images.Although the CelebA dataset comes directly from torchvision, it is not used directly in
celeba.py
. Thetorchvision.datasets.CelebA
is wrapped by a localCelebA
class. The purpose of doing this is to addtargets
andclasses
attributes (which do not exist in the original class) to the training and test set of CelebA. These two class attributes are used by non-IID samplers as the basis to distribute training data with bias across the clients. Here, we just use the identity of each face in images as the targets, such that each client will likely contain much more data of some individuals than the others, which, I believe, reflects the scenario in real life. With this addition, the CelebA data should work fine with the existing IID and non-IID samplers.One major problem with CelebA data is that users will often fail to download CelebA data with
torchvision.datasets.CelebA
. The creators of CelebA host the data on Google Drive, and there is a daily limit of download capacity enforced by Google, so when the limit is reached, the user cannot download the data anymore for the day. This is an issue recognized by the PyTorch developers, and there is a detailed description here: pytorch/vision#1920According to the developers, there really is no solution other than waiting for the limit to reset since they cannot change where the data is hosted. The current way to work around this is for users to download all the files manually from the CelebA authors to their local machines under directory
<data path>/celeba
and unzip the compressed file in the same directory before running anything, where<data path>
is the data path specified by the users in the config.How has this been tested?
Individual methods in
celeba.DataSource
class are tested in local test files. The entire data set has not yet been tested for training. The dataset is expected to be tested when the DCGAN model is introducedTypes of changes
Checklist: