-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added LFW Dataset #4255
Added LFW Dataset #4255
Conversation
Hi @ABD-01! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ABD-01 and thanks for the PR. Sorry about the delay as I was on vacation.
I have a few minor comments inline and only reviewed LFW_People
, since most of the comments also apply to LFW_Pairs
. Apart from that I have three more comments:
- The two datasets have a very similar structure. Since they return different things, I think having two separate classes is fine to avoid polymorphism. Still, I wonder if it makes sense to have a common superclass, e.g.
_LFW
that groups the common behavior such as downloads. - We have automated tests for datasets located in test/test_datasets.py. Can you please add tests for these datasets as well? The main challenge is to provide "fake data", i.e. the (extracted) data that would be present after
.download()
is completed. It should resemble the structure of the original dataset, but only provide a few dummy images. Let me know if you need help with that. - Please also add the datasets to the list in
torchvision/datasets/__init__.py
as well to our documentation indocs/source/datasets.rst
. Otherwise they are not discoverable.
* Created a common superclass for people and pairs type datatsets * corrected the .download() method
Hi, I have made new commits as per the comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ABD-01 and thanks for the continuing effort. I have some minor comments inline. Apart from them, two more things:
- Please also add the new datasets to our documentation as shortly outlined in Added LFW Dataset #4255 (review)
- I have not yet reviewed the
_get_people
and_get_pairs
methods, as they really depend on the data format, which I don't know yet. However, this will be solved by the next point:
The tests are remaining so I'll add them, could you please guide me on how to proceed with that?
Sure! You need to create two test cases, LFWPeopleTestCase
and LFWPairsTestCase
, in tess_datasets.py
that inherit from
Line 180 in 7947fc8
class DatasetTestCase(unittest.TestCase): |
Please read the docstring to get a better understanding what needs to be done.
The absolute minimum that you need to provide, is an implementation of the inject_fake_data
method. It gets passed a tmp_dir
that needs to contain all files in the correct structure that would be present in root
if I would instantiate the dataset with LFWPeople(root, download=True)
. If the content changes depending on other keyword arguments, you can query them from the passed config
dict, i.e. config["train"]
. If you overwrite the class attribute ADDITIONAL_CONFIGS
for example with
ADDITIONAL_CONFIGS = datasets_utils.combinations_grid(
train=(True, False), image_set=("original", "funneled", "deepfunneled"),
)
you will get automate testing for all combinations.
Of course the files provided by inject_fake_data
musn't be the original files, but rather mock ups of them, i.e. "fake data". For example, you will want to create a peopleDevTrain.txt
file, that only contains a few lines. For the accompanying images, you can use convenience utilities such as
Line 666 in 7947fc8
def create_image_file( |
or
Line 702 in 7947fc8
def create_image_folder( |
For the LFW dataset, I propose you add a _LFWTestCase
that handles the injection of fake data so that the actual test cases only need to handle their special stuff.
If all of the above is "too much", we can also do it step by step. Implement something and if you hit a roadblock, push the commit and ping me so I can take a look at it and guide you more precisely.
Hey, sorry for the delay for this update (had some college work). |
I have updated the tests as all cases were failing before. ======================================================================
FAIL: test_feature_types (__main__.LFWPairsTestCase) (idx=0, image_set='deepfunneled', train=True)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/root/vision/test/datasets_utils.py", line 553, in test_feature_types
assert isinstance(feature, expected_feature_type)
AssertionError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now only the feature types test is failing and I can't figure out why is this happening.
The test is not implemented in a way that lets you check the types inside of a container. By providing (PIL.Image.Image, PIL.Image.Image)
as first element of the FEATURE_TYPES
class attribute, the test will check if the first element of a returned sample is either a PIL.Image.Image
or a PIL.Image.Image
. Since it is tuple, the test fails. Weirdly, the introspection of pytest
is "disabled" and this is why you don't get a proper error message (cc @NicolasHug ).
To fix this, you would need to define FEATURE_TYPES = (tuple, int)
. But then we should discuss if this is the way to go. What is the reason you want to return the image pair in a tuple rather than individually, i.e. return img1, img2, target
? We do it like that in our PhotoTour
dataset:
vision/torchvision/datasets/phototour.py
Line 118 in 5ef75d7
return data1, data2, m[2] |
Apart from that, IMO the PR is 95% done. Thanks a lot @ABD-01! I have one more minor comment inline.
Thanks for the Review.
We can do it like this as well. I guess that'll be better way. I was returning a tuple as we needed a pair of images and a label whether they are same or not. Apart form that I have a small doubt, there's a |
Since you want to contribute this dataset, I'm guessing you want to use it for research or something related, right? If you know the field, is the fold variant used in other publications? If not, I'm fine to table this for now and only add it if we get requests for it. If you want to add it, that would also be ok for me.
What format are you talking about? Isn't the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more nitpicks after I've looked at the tests again. Not blocking, but would improve the code quality.
Yes it's almost same, just that it is divided in 10 parts, whose each part can be viewed as |
Ok, but that means that each fold is similar to a split, right? If yes, instead of Still, let's first answer if we want to go that route in this PR. |
I will try to add that functionality, if works out easily then fine, else we are good to go with the current code. |
* Added functionality for 10-fold validation view * Optimized the code so to replace repeated lines by method in super class * Updated LFWPeople to get classes from all lfw-names.txt rather than just the classes fron trainset
Hey!, I did some updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ABD-01, I have a hint about the failing CI as well as one nitpick inline.
Apart from that, there are two things left for you to do at the moment: address #4255 (comment) and #4255 (comment). If that is done and tests are passing, I'll do one final pass, because I haven't reviewed the people / pair extraction yet. Please ping me if the PR is otherwise ready.
* Updated inject_fake_data method to create 10fold fake data * Minor changes in docstring and extra_repr
@pmeier The PR is ready !! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ABD-01! This really looks good. I have some very minor comments inline as well as one question. Otherwise this is good to go!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks a lot @ABD-01!
Thank you Sir 🙌 Will be happy to contribute further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Summary: * Added LFW Dataset * Added dataset to list in __init__.py * Updated lfw.py * Created a common superclass for people and pairs type datatsets * corrected the .download() method * Added docstrings and updated datasets.rst * Wrote tests for LFWPeople and LFWPairs * Resolved mypy error: Need type annotation for "data" * Updated inject_fake_data method for LFWPeople * Updated tests for LFW * Updated LFW tests and minor changes in lfw.py * Updated LFW * Added functionality for 10-fold validation view * Optimized the code so to replace repeated lines by method in super class * Updated LFWPeople to get classes from all lfw-names.txt rather than just the classes fron trainset * Updated lfw.py and tests * Updated inject_fake_data method to create 10fold fake data * Minor changes in docstring and extra_repr * resolved py lint errors * Added checksums for annotation files * Minor changes in test * Updated docstrings, defaults and minor changes in test * Removed 'os.path.exists' check Reviewed By: datumbox Differential Revision: D31268043 fbshipit-source-id: 4fba9d652184a176220fe1460fc22144caa81cc2 Co-authored-by: ABD-01 <myac931@gmai.com> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Francisco Massa <fvsmassa@gmail.com>
Hi!
This pr is with reference to #3562
I added the dataset class for LFW Labeled Faces in Wild
Please do comment for the required changes, and other things to be done.
Thank You