Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the overlapped identities between LFW and ms1m #24

Closed
zhenglaizhang opened this issue Jan 30, 2018 · 14 comments
Closed

the overlapped identities between LFW and ms1m #24

zhenglaizhang opened this issue Jan 30, 2018 · 14 comments

Comments

@zhenglaizhang
Copy link

Awesome work!
As I know, there are some overlapped identities between LFW and ms1m, does the clean list has removed the overlapped identities, this may affect the performance on LFW

@azat-d
Copy link

azat-d commented Jan 30, 2018

Also, there are some overlapped identities between facescrub and ms1m. I downloaded from the freebase the correspondence between MIDs and real names. Please check the attachment
mid_to_name.txt.zip
UPD: Aaron Eckhart has an identifier m.03t4cz. This person is present in both the test and the training sets. Obviously, there are other such persons.
UPD2: m.04wp3s:Sam Rockwell, m.014zfs:Bill Cosby, m.02h3tp:Patrick Swayze, etc - all these identities are both in training and test sets (I've just checked it manually, I believe that there is more than 50% of the intersection.)

@nttstar
Copy link
Collaborator

nttstar commented Jan 30, 2018

We're doing such experiment and will be available in our paper soon, slightly worse I think(<0.1).
We have already removed 500+ identities from ms1m by checking the similarity between facescrub and ms1m. Please see src/data/dataset_merge.py if you want to know how we remove overlaps.

@azat-d
Copy link

azat-d commented Jan 30, 2018

I just wrote a script that checks for matches between test persons (subset of facescrub that used in MegaFace challenge) and persons from the training set (your cleaned ms1m list). There are 54/80 persons that are both in training and test sets:
Stana_Katic m.0fd6sd
Farrah_Fawcett m.01j851
Sam_Rockwell m.04wp3s
Alec_Baldwin m.018ygt
Christopher_Reeve m.0jrny
James_Remar m.05mlqj
Brendan_Fraser m.0227tr
Brianna_Brown m.0gdvdh
Andrea_Bowen m.05dxl5
Tempestt_Bledsoe m.014yqb
Paul_Bettany m.01chc7
Robert_Redford m.0gs1_
Mark_Wahlberg m.0gy6z9
Sarah_Hyland m.0523pz4
Alley_Mills m.0d_3hq
Kit_Harington m.09v4hnq
Victoria_Justice m.07w71b
Robert_Duvall m.015c4g
Edie_Falco m.01dy7j
Peggy_McCay m.05j0x1
Jeremy_Irons m.016ywr
Rebecca_Budig m.03jtgb
Brad_Garrett m.01rcmg
Bill_Cosby m.014zfs
Christel_Khalil m.0719hb
Lindsay_Hartley m.04w9ky
Joanna_Kerns m.0403xb
Emile_Hirsch m.05mkhs
Christine_Lakin m.06wr68
Marilu_Henner m.02pzx7
James_Marsden m.042ly5
Justin_Timberlake m.0j1yf
Adam_Brody m.0214df
Patrick_Swayze m.02h3tp
John_Malkovich m.017r13
Melina_Kanakaredes m.02pbhg
Nadia_Bjorlin m.04vpr3
Ryan_Phillippe m.01ksr1
Fran_Drescher m.01s3kv
Norman_Reedus m.0bs6hr
Robert_Knepper m.07v7p6
Didi_Conn m.04tvm2
Bobbie_Eakes m.03s_t9
Heath_Ledger m.0237fw
Summer_Glau m.039g0_
Emily_Deschanel m.03vd_l
Orlando_Bloom m.09wj5
Daniel_Day-Lewis m.016yvw
Shia_LaBeouf m.04w391
Kimberlin_Brown m.03ff8f
Adrienne_Barbeau m.01z7nj
Dean_Cain m.02qjj7
Erin_Cummings m.063z0nr
Joaquin_Phoenix m.018db8

@nttstar
Copy link
Collaborator

nttstar commented Jan 30, 2018

@azat-d I think it is also very difficult to find ALL overlaps by names matching.

@azat-d
Copy link

azat-d commented Jan 30, 2018

Agree. But according to my test there are at least 67.5% overlap. I don't trust to any results that are based on celebrity datasets. The most reliable test is NIST FRVT test, which is free for all researchers.

@nttstar
Copy link
Collaborator

nttstar commented Jan 30, 2018

@azat-d I have removed 500+ identities from MS1M by comparing with facescrub dataset, to test MegaFace. By reference, facescrub have only 530 identities in total. I believe our result is quite reliable.

@azat-d
Copy link

azat-d commented Jan 30, 2018

Megaface test use only 80 identities from facescrub. And checked YOURS train list against those identities.

@azat-d
Copy link

azat-d commented Jan 30, 2018

And I've found that 54/80 identities are both in test and in yours training set.

@azat-d
Copy link

azat-d commented Jan 30, 2018

I'm talking about this https://pan.baidu.com/s/1eTn6O62 training set

@azat-d
Copy link

azat-d commented Jan 30, 2018

Do you mean that there was additional cleaning of this list?

@nttstar
Copy link
Collaborator

nttstar commented Jan 30, 2018

500+ identities were removed in my binary packed dataset, not this clean list. You can check it in our paper and there's about 0.3% performance drop(98.3% -> 98.0%)
You need to generate features for all 530 identities if you want to upload the result, 80 identities is only required by set-1.

@azat-d
Copy link

azat-d commented Jan 30, 2018

Ok, thank you!

@zhenglaizhang
Copy link
Author

So great to hear that the results about overlapping identities removing, thank you guys, I will also take a look at this then, may update if any new results here.

@zhenglaizhang
Copy link
Author

closing as this is well discussed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants