Add VGGface2 dataset #1193

dakshjotwani · 2019-08-02T11:05:44Z

PR for adding VGGFace2 dataset.

Questions:

I'm not sure on how to add test cases for this (Not sure how to produce fake data to test it). I tried to find CelebA test cases to use as reference and didn't see any test cases for that dataset either.
Currently I am using a dictionary to store bounding box and landmark data. Do you think it would be better/more efficient to store it as a list instead?

codecov-io · 2019-08-05T06:10:46Z

Codecov Report

Merging #1193 into master will decrease coverage by 0.75%.
The diff coverage is 19.35%.

@@            Coverage Diff             @@
##           master    #1193      +/-   ##
==========================================
- Coverage   65.74%   64.99%   -0.76%     
==========================================
  Files          79       75       -4     
  Lines        5827     5842      +15     
  Branches      887      898      +11     
==========================================
- Hits         3831     3797      -34     
- Misses       1727     1778      +51     
+ Partials      269      267       -2

Impacted Files	Coverage Δ
torchvision/datasets/vggface2.py	`18.03% <18.03%> (ø)`
torchvision/datasets/__init__.py	`100.00% <100.00%> (ø)`
torchvision/datasets/hmdb51.py	`27.65% <0.00%> (-3.30%)`	⬇️
torchvision/datasets/ucf101.py	`25.00% <0.00%> (-3.21%)`	⬇️
torchvision/transforms/functional.py	`70.23% <0.00%> (-1.16%)`	⬇️
torchvision/transforms/transforms.py	`80.35% <0.00%> (-0.59%)`	⬇️
torchvision/io/video.py	`72.00% <0.00%> (ø)`
torchvision/ops/boxes.py	`94.73% <0.00%> (ø)`
torchvision/models/video/__init__.py	`100.00% <0.00%> (ø)`
torchvision/models/video/r3d.py
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 94c9417...6d907ba. Read the comment docs.

fmassa

Thanks for the PR!

About the tests, can you create some fake images following imagenet, and also some random csv files?

Check test/test_datasets.py for more details.

About the structure of the dataset, I'm thinking if we should always return the full image plus a dict with the id / box / landmarks.

Even though some datasets already provide options in the constructor to define how the output should look like, I think it might just be better to always return the same thing consistently, and let the transform / user code handle the data they want.

Thoughts?

dakshjotwani · 2019-08-06T05:16:26Z

About the structure of the dataset, I'm thinking if we should always return the full image plus a dict with the id / box / landmarks.

This might not be a good idea, since it will require the user to download all the optional metadata files to use the dataset. If all the metadata came from one file, this would have been good. Also, having to provide additional data only to remove it using a target_transform feels wrong. What do you think?

fmassa · 2019-08-07T00:11:47Z

since it will require the user to download all the optional metadata files to use the dataset

But the metadata is much smaller than the images in size, right? So this should not be a problem I think. And we could provide the download function in the dataset itself.

Also, having to provide additional data only to remove it using a target_transform feels wrong. What do you think?

I'm not 100% sure. Having the type of output of the dataset depend on a constructor argument doesn't sound like the best thing to me.

I'll let others give their opinion in here.

cc @soumith @cpuhrsch @zhangguanheng66 what do you think?

dakshjotwani · 2019-08-28T00:17:26Z

@fmassa what should we do?

fmassa · 2019-09-30T15:11:35Z

I'm very sorry for the delay in replying, it was sent just after I came back from holidays and I missed your message from a pile of notifications.

I still think that returning a dict with all the metadata would make more sense, if it's what the dataset provides by default. As I mentioned before, having the return values depend on a constructor argument adds a lot less structure to the datasets, and is done only for convenience with the current transforms. Plus, we would probably also want to handle the empty case as well, as discussed in #1351

I'll ping @cpuhrsch @zhangguanheng66 and @vincentqb again on thoughts about it.

vincentqb · 2019-09-30T15:39:36Z

I still think that returning a dict with all the metadata would make more sense, if it's what the dataset provides by default.

I second that, with maybe a minor tweak.

Do we have an example of a dataset returning more than one tensor? I'd say a dict is a good idea, but the tensors returned could be part of it to address such case. This however pushes the organization of a dataset into how we structure the dict itself. :) Thoughts?

it will require the user to download all the optional metadata files to use the dataset. If all the metadata came from one file, this would have been good.

Can you clarify this @dakshjotwani? Could the dict only contain the fields that were downloaded?

dakshjotwani · 2019-11-09T22:02:24Z

Sorry for the late reply, I just saw these messages buried in my email!

Can you clarify this @dakshjotwani? Could the dict only contain the fields that were downloaded?

Yes we can. I think that's a good way to avoid returning extra data that would later end up being removed with a target_transform.

@fmassa I agree that returning a dict is a good idea as long as we construct the dict based on the metadata/fields files provided as args to the constructor. What do you think?

jgbradley1 · 2020-10-23T19:39:24Z

Is this PR still being worked?

dakshjotwani · 2020-10-23T19:47:44Z

I can pick it up again. @fmassa @jgbradley1 should I close this PR and open a new one? This one's about a year old and a lot of things have changed since then.

jgbradley1 · 2020-10-23T20:12:11Z

Since 2 files are only impacted by this change, it seems simpler to just update the branch from master. There would really only be one conflict to resolve probably (i.e. no need to start a new PR).

I'm interested in seeing this dataset get added.

A couple suggestions I have:

~~consider adding a download option~~
since the dataset is not automatically accessible (requires login before downloading), consider adding more information in the docstring that explains what the expected folder structure should look like that the data files must be in. I just submitted a PR for another dataset and after reading through several of the current Dataset classes, I started to realize how unclear it may be to new users (that use a local copy) to hook up torchvision to their local folders.
consider adding a split argument that must be set to either train/test. It's not immediately clear to me what partition is being used.

jgbradley1 · 2020-10-27T04:35:10Z

Since this PR has been open for awhile, I wrote a quick updated version of what the dataset could look like here. It matches the design of the CelebA Dataset.

yassineAlouini · 2022-05-02T14:56:33Z

It seems that this PR is being continued here: #2910. I guess it can be closed @pmeier?

pmeier · 2022-05-03T06:44:21Z

Given that @dakshjotwani wrote:

I can pick it up again.

and @jgbradley1 already send the update in #2910, I'm going to close this PR.

dakshjotwani added 5 commits August 1, 2019 17:02

Add VGGFace2 dataset

d1cc9e9

Add bbox csv support

8cc6ad4

Add landmark csv support

037ef87

Add __len__ and extra_repr methods

3e780e1

Remove tuple unpack for python2

6d907ba

fmassa reviewed Aug 5, 2019

View reviewed changes

vincentqb mentioned this pull request Oct 8, 2019

Shared Dataset Functionality pytorch/pytorch#24915

Open

jgbradley1 mentioned this pull request Oct 27, 2020

Add vggface2 dataset #2910

Closed

oke-aditya mentioned this pull request Mar 12, 2021

[RFC] New datasets to torchvision #3562

Open

17 tasks

pmeier self-assigned this Apr 8, 2022

pmeier closed this May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VGGface2 dataset #1193

Add VGGface2 dataset #1193

dakshjotwani commented Aug 2, 2019

codecov-io commented Aug 5, 2019 •

edited by codecov bot

Loading

fmassa left a comment

dakshjotwani commented Aug 6, 2019

fmassa commented Aug 7, 2019

dakshjotwani commented Aug 28, 2019

fmassa commented Sep 30, 2019

vincentqb commented Sep 30, 2019

dakshjotwani commented Nov 9, 2019

jgbradley1 commented Oct 23, 2020

dakshjotwani commented Oct 23, 2020

jgbradley1 commented Oct 23, 2020 •

edited

Loading

jgbradley1 commented Oct 27, 2020

yassineAlouini commented May 2, 2022

pmeier commented May 3, 2022

Add VGGface2 dataset #1193

Add VGGface2 dataset #1193

Conversation

dakshjotwani commented Aug 2, 2019

codecov-io commented Aug 5, 2019 • edited by codecov bot Loading

Codecov Report

fmassa left a comment

Choose a reason for hiding this comment

dakshjotwani commented Aug 6, 2019

fmassa commented Aug 7, 2019

dakshjotwani commented Aug 28, 2019

fmassa commented Sep 30, 2019

vincentqb commented Sep 30, 2019

dakshjotwani commented Nov 9, 2019

jgbradley1 commented Oct 23, 2020

dakshjotwani commented Oct 23, 2020

jgbradley1 commented Oct 23, 2020 • edited Loading

jgbradley1 commented Oct 27, 2020

yassineAlouini commented May 2, 2022

pmeier commented May 3, 2022

codecov-io commented Aug 5, 2019 •

edited by codecov bot

Loading

jgbradley1 commented Oct 23, 2020 •

edited

Loading