Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added LFW Dataset #4255

Merged
merged 31 commits into from
Sep 14, 2021
Merged

Added LFW Dataset #4255

merged 31 commits into from
Sep 14, 2021

Conversation

ABD-01
Copy link
Contributor

@ABD-01 ABD-01 commented Aug 5, 2021

Hi!
This pr is with reference to #3562

I added the dataset class for LFW Labeled Faces in Wild

Please do comment for the required changes, and other things to be done.
Thank You

@facebook-github-bot
Copy link

Hi @ABD-01!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ABD-01 and thanks for the PR. Sorry about the delay as I was on vacation.

I have a few minor comments inline and only reviewed LFW_People, since most of the comments also apply to LFW_Pairs. Apart from that I have three more comments:

  1. The two datasets have a very similar structure. Since they return different things, I think having two separate classes is fine to avoid polymorphism. Still, I wonder if it makes sense to have a common superclass, e.g. _LFW that groups the common behavior such as downloads.
  2. We have automated tests for datasets located in test/test_datasets.py. Can you please add tests for these datasets as well? The main challenge is to provide "fake data", i.e. the (extracted) data that would be present after .download() is completed. It should resemble the structure of the original dataset, but only provide a few dummy images. Let me know if you need help with that.
  3. Please also add the datasets to the list in torchvision/datasets/__init__.py as well to our documentation in docs/source/datasets.rst. Otherwise they are not discoverable.

torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
ABD-01 and others added 3 commits August 23, 2021 10:18
* Created a common superclass for people and pairs type datatsets
* corrected the .download() method
@ABD-01
Copy link
Contributor Author

ABD-01 commented Aug 23, 2021

Hi, I have made new commits as per the comments.
The tests are remaining so I'll add them, could you please guide me on how to proceed with that?

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ABD-01 and thanks for the continuing effort. I have some minor comments inline. Apart from them, two more things:

  1. Please also add the new datasets to our documentation as shortly outlined in Added LFW Dataset #4255 (review)
  2. I have not yet reviewed the _get_people and _get_pairs methods, as they really depend on the data format, which I don't know yet. However, this will be solved by the next point:

The tests are remaining so I'll add them, could you please guide me on how to proceed with that?

Sure! You need to create two test cases, LFWPeopleTestCase and LFWPairsTestCase, in tess_datasets.py that inherit from

class DatasetTestCase(unittest.TestCase):

Please read the docstring to get a better understanding what needs to be done.

The absolute minimum that you need to provide, is an implementation of the inject_fake_data method. It gets passed a tmp_dir that needs to contain all files in the correct structure that would be present in root if I would instantiate the dataset with LFWPeople(root, download=True). If the content changes depending on other keyword arguments, you can query them from the passed config dict, i.e. config["train"]. If you overwrite the class attribute ADDITIONAL_CONFIGS for example with

ADDITIONAL_CONFIGS = datasets_utils.combinations_grid(
    train=(True, False), image_set=("original", "funneled", "deepfunneled"),
)

you will get automate testing for all combinations.

Of course the files provided by inject_fake_data musn't be the original files, but rather mock ups of them, i.e. "fake data". For example, you will want to create a peopleDevTrain.txt file, that only contains a few lines. For the accompanying images, you can use convenience utilities such as

def create_image_file(

or

def create_image_folder(

For the LFW dataset, I propose you add a _LFWTestCase that handles the injection of fake data so that the actual test cases only need to handle their special stuff.


If all of the above is "too much", we can also do it step by step. Implement something and if you hit a roadblock, push the commit and ping me so I can take a look at it and guide you more precisely.

torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
@ABD-01
Copy link
Contributor Author

ABD-01 commented Aug 28, 2021

Hey, sorry for the delay for this update (had some college work).
I have added the tests for LFWPeopleTestCase and used it as super class for LFWPairsTestCase

@ABD-01
Copy link
Contributor Author

ABD-01 commented Aug 29, 2021

I have updated the tests as all cases were failing before.
Now only the feature types test is failing and I can't figure out why is this happening.

======================================================================
FAIL: test_feature_types (__main__.LFWPairsTestCase) (idx=0, image_set='deepfunneled', train=True)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/vision/test/datasets_utils.py", line 553, in test_feature_types
    assert isinstance(feature, expected_feature_type)
AssertionError

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now only the feature types test is failing and I can't figure out why is this happening.

The test is not implemented in a way that lets you check the types inside of a container. By providing (PIL.Image.Image, PIL.Image.Image) as first element of the FEATURE_TYPES class attribute, the test will check if the first element of a returned sample is either a PIL.Image.Image or a PIL.Image.Image. Since it is tuple, the test fails. Weirdly, the introspection of pytest is "disabled" and this is why you don't get a proper error message (cc @NicolasHug ).

To fix this, you would need to define FEATURE_TYPES = (tuple, int). But then we should discuss if this is the way to go. What is the reason you want to return the image pair in a tuple rather than individually, i.e. return img1, img2, target? We do it like that in our PhotoTour dataset:

return data1, data2, m[2]

Apart from that, IMO the PR is 95% done. Thanks a lot @ABD-01! I have one more minor comment inline.

test/test_datasets.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
@ABD-01
Copy link
Contributor Author

ABD-01 commented Aug 30, 2021

Thanks for the Review.

What is the reason you want to return the image pair in a tuple rather than individually, i.e. return img1, img2, target?

We can do it like this as well. I guess that'll be better way. I was returning a tuple as we needed a pair of images and a label whether they are same or not.

Apart form that I have a small doubt, there's a 10 fold validation (pairs and people) view of dataset, should I add it??
The problem is that the format is similar but not same, so how can we add that?
Can we have like: split: Train, Test, 10-fold. And for 10-fold use different function to get data?

@pmeier
Copy link
Collaborator

pmeier commented Aug 30, 2021

Apart form that I have a small doubt, there's a 10 fold validation (pairs and people) view of dataset, should I add it??

Since you want to contribute this dataset, I'm guessing you want to use it for research or something related, right? If you know the field, is the fold variant used in other publications? If not, I'm fine to table this for now and only add it if we get requests for it. If you want to add it, that would also be ok for me.

The problem is that the format is similar but not same, so how can we add that?
Can we have like: split: Train, Test, 10-fold. And for 10-fold use different function to get data?

What format are you talking about? Isn't the 10-fold variant just a different selection of images rather than the train / test split?

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more nitpicks after I've looked at the tests again. Not blocking, but would improve the code quality.

test/test_datasets.py Outdated Show resolved Hide resolved
test/test_datasets.py Outdated Show resolved Hide resolved
@ABD-01
Copy link
Contributor Author

ABD-01 commented Aug 30, 2021

What format are you talking about? Isn't the 10-fold variant just a different selection of images rather than the train / test split?

Yes it's almost same, just that it is divided in 10 parts, whose each part can be viewed as LFWPairs or LFWPeople class

@pmeier
Copy link
Collaborator

pmeier commented Aug 30, 2021

Yes it's almost same, just that it is divided in 10 parts, whose each part can be viewed as LFWPairs or LFWPeople class

Ok, but that means that each fold is similar to a split, right? If yes, instead of train: bool we could have split: str, which takes either "train", "test", as well as fold_X where X is the number of the split you want to select. What do you think?

Still, let's first answer if we want to go that route in this PR.

@ABD-01
Copy link
Contributor Author

ABD-01 commented Aug 30, 2021

Ok, but that means that each fold is similar to a split, right? If yes, instead of train: bool we could have split: str, which takes either "train", "test", as well as fold_X where X is the number of the split you want to select. What do you think?

Still, let's first answer if we want to go that route in this PR.

I will try to add that functionality, if works out easily then fine, else we are good to go with the current code.

ABD-01 and others added 4 commits August 31, 2021 21:39
* Added functionality for 10-fold validation view
* Optimized the code so to replace repeated lines by method in super
  class
* Updated LFWPeople to get classes from all lfw-names.txt rather than
  just the classes fron trainset
@ABD-01
Copy link
Contributor Author

ABD-01 commented Sep 4, 2021

Hey!, I did some updates
I will add the tests in short time

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ABD-01, I have a hint about the failing CI as well as one nitpick inline.

Apart from that, there are two things left for you to do at the moment: address #4255 (comment) and #4255 (comment). If that is done and tests are passing, I'll do one final pass, because I haven't reviewed the people / pair extraction yet. Please ping me if the PR is otherwise ready.

test/test_datasets.py Outdated Show resolved Hide resolved
test/test_datasets.py Outdated Show resolved Hide resolved
test/test_datasets.py Outdated Show resolved Hide resolved
@ABD-01
Copy link
Contributor Author

ABD-01 commented Sep 9, 2021

@pmeier The PR is ready !!

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ABD-01! This really looks good. I have some very minor comments inline as well as one question. Otherwise this is good to go!

torchvision/datasets/lfw.py Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Outdated Show resolved Hide resolved
test/test_datasets.py Outdated Show resolved Hide resolved
torchvision/datasets/lfw.py Show resolved Hide resolved
torchvision/datasets/lfw.py Show resolved Hide resolved
Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks a lot @ABD-01!

@ABD-01
Copy link
Contributor Author

ABD-01 commented Sep 10, 2021

Thank you Sir 🙌
Learned a lot here.

Will be happy to contribute further.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@fmassa fmassa merged commit d85aa6d into pytorch:main Sep 14, 2021
facebook-github-bot pushed a commit that referenced this pull request Sep 30, 2021
Summary:
* Added LFW Dataset

* Added dataset to list in __init__.py

* Updated lfw.py
* Created a common superclass for people and pairs type datatsets
* corrected the .download() method

* Added docstrings and updated datasets.rst

* Wrote tests for LFWPeople and LFWPairs

* Resolved mypy error: Need type annotation for "data"

* Updated inject_fake_data method for LFWPeople

* Updated tests for LFW

* Updated LFW tests and minor changes in lfw.py

* Updated LFW
* Added functionality for 10-fold validation view
* Optimized the code so to replace repeated lines by method in super
  class
* Updated LFWPeople to get classes from all lfw-names.txt rather than
  just the classes fron trainset

* Updated lfw.py and tests
* Updated inject_fake_data method to create 10fold fake data
* Minor changes in docstring and extra_repr

* resolved py lint errors

* Added checksums for annotation files

* Minor changes in test

* Updated docstrings, defaults and minor changes in test

* Removed 'os.path.exists' check

Reviewed By: datumbox

Differential Revision: D31268043

fbshipit-source-id: 4fba9d652184a176220fe1460fc22144caa81cc2

Co-authored-by: ABD-01 <myac931@gmai.com>
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
Co-authored-by: Francisco Massa <fvsmassa@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants