Added LFW Dataset #4255

ABD-01 · 2021-08-05T11:02:56Z

Hi!
This pr is with reference to #3562

I added the dataset class for LFW Labeled Faces in Wild

Please do comment for the required changes, and other things to be done.
Thank You

facebook-github-bot · 2021-08-05T11:03:00Z

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot · 2021-08-05T11:19:21Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

pmeier

Hey @ABD-01 and thanks for the PR. Sorry about the delay as I was on vacation.

I have a few minor comments inline and only reviewed LFW_People, since most of the comments also apply to LFW_Pairs. Apart from that I have three more comments:

The two datasets have a very similar structure. Since they return different things, I think having two separate classes is fine to avoid polymorphism. Still, I wonder if it makes sense to have a common superclass, e.g. _LFW that groups the common behavior such as downloads.
We have automated tests for datasets located in test/test_datasets.py. Can you please add tests for these datasets as well? The main challenge is to provide "fake data", i.e. the (extracted) data that would be present after .download() is completed. It should resemble the structure of the original dataset, but only provide a few dummy images. Let me know if you need help with that.
Please also add the datasets to the list in torchvision/datasets/__init__.py as well to our documentation in docs/source/datasets.rst. Otherwise they are not discoverable.

torchvision/datasets/lfw.py

* Created a common superclass for people and pairs type datatsets * corrected the .download() method

ABD-01 · 2021-08-23T05:35:12Z

Hi, I have made new commits as per the comments.
The tests are remaining so I'll add them, could you please guide me on how to proceed with that?

pmeier

Hey @ABD-01 and thanks for the continuing effort. I have some minor comments inline. Apart from them, two more things:

Please also add the new datasets to our documentation as shortly outlined in Added LFW Dataset #4255 (review)
I have not yet reviewed the _get_people and _get_pairs methods, as they really depend on the data format, which I don't know yet. However, this will be solved by the next point:

The tests are remaining so I'll add them, could you please guide me on how to proceed with that?

Sure! You need to create two test cases, LFWPeopleTestCase and LFWPairsTestCase, in tess_datasets.py that inherit from

vision/test/datasets_utils.py

Line 180 in 7947fc8

class DatasetTestCase(unittest.TestCase):

Please read the docstring to get a better understanding what needs to be done.

The absolute minimum that you need to provide, is an implementation of the inject_fake_data method. It gets passed a tmp_dir that needs to contain all files in the correct structure that would be present in root if I would instantiate the dataset with LFWPeople(root, download=True). If the content changes depending on other keyword arguments, you can query them from the passed config dict, i.e. config["train"]. If you overwrite the class attribute ADDITIONAL_CONFIGS for example with

ADDITIONAL_CONFIGS = datasets_utils.combinations_grid(
    train=(True, False), image_set=("original", "funneled", "deepfunneled"),
)

you will get automate testing for all combinations.

Of course the files provided by inject_fake_data musn't be the original files, but rather mock ups of them, i.e. "fake data". For example, you will want to create a peopleDevTrain.txt file, that only contains a few lines. For the accompanying images, you can use convenience utilities such as

vision/test/datasets_utils.py

Line 666 in 7947fc8

def create_image_file(

or

vision/test/datasets_utils.py

Line 702 in 7947fc8

def create_image_folder(

For the LFW dataset, I propose you add a _LFWTestCase that handles the injection of fake data so that the actual test cases only need to handle their special stuff.

If all of the above is "too much", we can also do it step by step. Implement something and if you hit a roadblock, push the commit and ping me so I can take a look at it and guide you more precisely.

torchvision/datasets/lfw.py

ABD-01 · 2021-08-28T16:41:28Z

Hey, sorry for the delay for this update (had some college work).
I have added the tests for LFWPeopleTestCase and used it as super class for LFWPairsTestCase

ABD-01 · 2021-08-29T08:05:30Z

I have updated the tests as all cases were failing before.
Now only the feature types test is failing and I can't figure out why is this happening.

======================================================================
FAIL: test_feature_types (__main__.LFWPairsTestCase) (idx=0, image_set='deepfunneled', train=True)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/vision/test/datasets_utils.py", line 553, in test_feature_types
    assert isinstance(feature, expected_feature_type)
AssertionError

pmeier

Now only the feature types test is failing and I can't figure out why is this happening.

The test is not implemented in a way that lets you check the types inside of a container. By providing (PIL.Image.Image, PIL.Image.Image) as first element of the FEATURE_TYPES class attribute, the test will check if the first element of a returned sample is either a PIL.Image.Image or a PIL.Image.Image. Since it is tuple, the test fails. Weirdly, the introspection of pytest is "disabled" and this is why you don't get a proper error message (cc @NicolasHug ).

To fix this, you would need to define FEATURE_TYPES = (tuple, int). But then we should discuss if this is the way to go. What is the reason you want to return the image pair in a tuple rather than individually, i.e. return img1, img2, target? We do it like that in our PhotoTour dataset:

vision/torchvision/datasets/phototour.py

Line 118 in 5ef75d7

return data1, data2, m[2]

Apart from that, IMO the PR is 95% done. Thanks a lot @ABD-01! I have one more minor comment inline.

test/test_datasets.py

torchvision/datasets/lfw.py

ABD-01 · 2021-08-30T06:38:38Z

Thanks for the Review.

What is the reason you want to return the image pair in a tuple rather than individually, i.e. return img1, img2, target?

We can do it like this as well. I guess that'll be better way. I was returning a tuple as we needed a pair of images and a label whether they are same or not.

Apart form that I have a small doubt, there's a 10 fold validation (pairs and people) view of dataset, should I add it??
The problem is that the format is similar but not same, so how can we add that?
Can we have like: split: Train, Test, 10-fold. And for 10-fold use different function to get data?

pmeier · 2021-08-30T06:46:18Z

Apart form that I have a small doubt, there's a 10 fold validation (pairs and people) view of dataset, should I add it??

Since you want to contribute this dataset, I'm guessing you want to use it for research or something related, right? If you know the field, is the fold variant used in other publications? If not, I'm fine to table this for now and only add it if we get requests for it. If you want to add it, that would also be ok for me.

The problem is that the format is similar but not same, so how can we add that?
Can we have like: split: Train, Test, 10-fold. And for 10-fold use different function to get data?

What format are you talking about? Isn't the 10-fold variant just a different selection of images rather than the train / test split?

pmeier

Two more nitpicks after I've looked at the tests again. Not blocking, but would improve the code quality.

test/test_datasets.py

ABD-01 · 2021-08-30T06:53:35Z

What format are you talking about? Isn't the 10-fold variant just a different selection of images rather than the train / test split?

Yes it's almost same, just that it is divided in 10 parts, whose each part can be viewed as LFWPairs or LFWPeople class

pmeier · 2021-08-30T06:58:40Z

Yes it's almost same, just that it is divided in 10 parts, whose each part can be viewed as LFWPairs or LFWPeople class

Ok, but that means that each fold is similar to a split, right? If yes, instead of train: bool we could have split: str, which takes either "train", "test", as well as fold_X where X is the number of the split you want to select. What do you think?

Still, let's first answer if we want to go that route in this PR.

ABD-01 · 2021-08-30T07:08:29Z

Ok, but that means that each fold is similar to a split, right? If yes, instead of train: bool we could have split: str, which takes either "train", "test", as well as fold_X where X is the number of the split you want to select. What do you think?

Still, let's first answer if we want to go that route in this PR.

I will try to add that functionality, if works out easily then fine, else we are good to go with the current code.

* Added functionality for 10-fold validation view * Optimized the code so to replace repeated lines by method in super class * Updated LFWPeople to get classes from all lfw-names.txt rather than just the classes fron trainset

ABD-01 · 2021-09-04T10:19:38Z

Hey!, I did some updates
I will add the tests in short time

pmeier

Hey @ABD-01, I have a hint about the failing CI as well as one nitpick inline.

Apart from that, there are two things left for you to do at the moment: address #4255 (comment) and #4255 (comment). If that is done and tests are passing, I'll do one final pass, because I haven't reviewed the people / pair extraction yet. Please ping me if the PR is otherwise ready.

test/test_datasets.py

* Updated inject_fake_data method to create 10fold fake data * Minor changes in docstring and extra_repr

ABD-01 · 2021-09-09T18:03:02Z

@pmeier The PR is ready !!

pmeier

Hey @ABD-01! This really looks good. I have some very minor comments inline as well as one question. Otherwise this is good to go!

torchvision/datasets/lfw.py

test/test_datasets.py

torchvision/datasets/lfw.py

pmeier

This is great, thanks a lot @ABD-01!

ABD-01 · 2021-09-10T13:49:18Z

Thank you Sir 🙌
Learned a lot here.

Will be happy to contribute further.

fmassa

Thanks!

Summary: * Added LFW Dataset * Added dataset to list in __init__.py * Updated lfw.py * Created a common superclass for people and pairs type datatsets * corrected the .download() method * Added docstrings and updated datasets.rst * Wrote tests for LFWPeople and LFWPairs * Resolved mypy error: Need type annotation for "data" * Updated inject_fake_data method for LFWPeople * Updated tests for LFW * Updated LFW tests and minor changes in lfw.py * Updated LFW * Added functionality for 10-fold validation view * Optimized the code so to replace repeated lines by method in super class * Updated LFWPeople to get classes from all lfw-names.txt rather than just the classes fron trainset * Updated lfw.py and tests * Updated inject_fake_data method to create 10fold fake data * Minor changes in docstring and extra_repr * resolved py lint errors * Added checksums for annotation files * Minor changes in test * Updated docstrings, defaults and minor changes in test * Removed 'os.path.exists' check Reviewed By: datumbox Differential Revision: D31268043 fbshipit-source-id: 4fba9d652184a176220fe1460fc22144caa81cc2 Co-authored-by: ABD-01 <myac931@gmai.com> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Francisco Massa <fvsmassa@gmail.com>

Added LFW Dataset

959880f

facebook-github-bot added the cla signed label Aug 5, 2021

ABD-01 added 2 commits August 6, 2021 18:29

Merge branch 'master' into add_lfw

5f4503e

Merge branch 'master' into add_lfw

4b012c5

datumbox added the module: datasets label Aug 16, 2021

datumbox requested a review from pmeier August 16, 2021 13:29

pmeier reviewed Aug 18, 2021

View reviewed changes

ABD-01 and others added 3 commits August 23, 2021 10:18

Merge branch 'pytorch:main' into add_lfw

bd414d4

Added dataset to list in __init__.py

ac4b4ad

Updated lfw.py

cfef8c8

* Created a common superclass for people and pairs type datatsets * corrected the .download() method

pmeier reviewed Aug 23, 2021

View reviewed changes

torchvision/datasets/lfw.py Outdated Show resolved Hide resolved

torchvision/datasets/lfw.py Outdated Show resolved Hide resolved

torchvision/datasets/lfw.py Outdated Show resolved Hide resolved

torchvision/datasets/lfw.py Outdated Show resolved Hide resolved

oke-aditya mentioned this pull request Aug 24, 2021

[RFC] New datasets to torchvision #3562

Open

17 tasks

ABD-01 and others added 4 commits August 26, 2021 16:12

Merge branch 'pytorch:main' into add_lfw

66f43c9

Added docstrings and updated datasets.rst

c7bf4ae

Wrote tests for LFWPeople and LFWPairs

749308a

Merge branch 'main' into add_lfw

3f4f214

ABD-01 added 3 commits August 29, 2021 00:30

Resolved mypy error: Need type annotation for "data"

6590da3

Updated inject_fake_data method for LFWPeople

acb68ae

Updated tests for LFW

10fffb0

pmeier reviewed Aug 30, 2021

View reviewed changes

test/test_datasets.py Outdated Show resolved Hide resolved

torchvision/datasets/lfw.py Outdated Show resolved Hide resolved

pmeier reviewed Aug 30, 2021

View reviewed changes

test/test_datasets.py Outdated Show resolved Hide resolved

test/test_datasets.py Outdated Show resolved Hide resolved

ABD-01 and others added 4 commits August 31, 2021 21:39

Merge branch 'pytorch:main' into add_lfw

0434472

Updated LFW tests and minor changes in lfw.py

87197f5

Merge branch 'main' into add_lfw

b34a173

Updated LFW

7549517

* Added functionality for 10-fold validation view * Optimized the code so to replace repeated lines by method in super class * Updated LFWPeople to get classes from all lfw-names.txt rather than just the classes fron trainset

pmeier reviewed Sep 6, 2021

View reviewed changes

test/test_datasets.py Outdated Show resolved Hide resolved

test/test_datasets.py Outdated Show resolved Hide resolved

test/test_datasets.py Outdated Show resolved Hide resolved

ABD-01 and others added 7 commits September 9, 2021 17:21

Updated lfw.py and tests

c52c891

* Updated inject_fake_data method to create 10fold fake data * Minor changes in docstring and extra_repr

Merge branch 'main' into add_lfw

a5146eb

resolved py lint errors

df96b44

Merge branch 'add_lfw' of github.com:ABD-01/vision into add_lfw

e93cd21

Merge branch 'main' into add_lfw

baf5556

Added checksums for annotation files

4feed66

Minor changes in test

e9cb48e

pmeier reviewed Sep 10, 2021

View reviewed changes

Updated docstrings, defaults and minor changes in test

cc475cc

pmeier reviewed Sep 10, 2021

View reviewed changes

torchvision/datasets/lfw.py Show resolved Hide resolved

torchvision/datasets/lfw.py Show resolved Hide resolved

Removed 'os.path.exists' check

1466334

pmeier approved these changes Sep 10, 2021

View reviewed changes

Merge branch 'main' into add_lfw

f5c41c4

ABD-01 and others added 3 commits September 10, 2021 21:10

Merge branch 'main' into add_lfw

69e8f2f

Merge branch 'main' into add_lfw

9860672

Merge branch 'main' into add_lfw

3e17463

fmassa approved these changes Sep 14, 2021

View reviewed changes

Merge branch 'main' into add_lfw

4c4b826

fmassa merged commit d85aa6d into pytorch:main Sep 14, 2021

ABD-01 mentioned this pull request Mar 27, 2022

[GSoC] 223 Adding Face-Verification using OpenVino Model openvinotoolkit/openvino_notebooks#509

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added LFW Dataset #4255

Added LFW Dataset #4255

ABD-01 commented Aug 5, 2021

facebook-github-bot commented Aug 5, 2021

facebook-github-bot commented Aug 5, 2021

pmeier left a comment

ABD-01 commented Aug 23, 2021

pmeier left a comment

ABD-01 commented Aug 28, 2021

ABD-01 commented Aug 29, 2021

pmeier left a comment

ABD-01 commented Aug 30, 2021

pmeier commented Aug 30, 2021

pmeier left a comment

ABD-01 commented Aug 30, 2021

pmeier commented Aug 30, 2021

ABD-01 commented Aug 30, 2021

ABD-01 commented Sep 4, 2021

pmeier left a comment

ABD-01 commented Sep 9, 2021

pmeier left a comment

pmeier left a comment

ABD-01 commented Sep 10, 2021

fmassa left a comment

Added LFW Dataset #4255

Added LFW Dataset #4255

Conversation

ABD-01 commented Aug 5, 2021

facebook-github-bot commented Aug 5, 2021

Action Required

Process

facebook-github-bot commented Aug 5, 2021

pmeier left a comment

Choose a reason for hiding this comment

ABD-01 commented Aug 23, 2021

pmeier left a comment

Choose a reason for hiding this comment

ABD-01 commented Aug 28, 2021

ABD-01 commented Aug 29, 2021

pmeier left a comment

Choose a reason for hiding this comment

ABD-01 commented Aug 30, 2021

pmeier commented Aug 30, 2021

pmeier left a comment

Choose a reason for hiding this comment

ABD-01 commented Aug 30, 2021

pmeier commented Aug 30, 2021

ABD-01 commented Aug 30, 2021

ABD-01 commented Sep 4, 2021

pmeier left a comment

Choose a reason for hiding this comment

ABD-01 commented Sep 9, 2021

pmeier left a comment

Choose a reason for hiding this comment

pmeier left a comment

Choose a reason for hiding this comment

ABD-01 commented Sep 10, 2021

fmassa left a comment

Choose a reason for hiding this comment