-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iBUG_DeepInsight #49
Comments
Right. |
In that case, I don't think it's fair to publish those results in the public results page of the MegaFace challenge. As far as I know, it's not allowed to modify the evaluation set when reporting results, as this would obviously make impossible to compare the accuracy of the different algorithms. I suggest that you report the results obtained using the original dataset instead. |
I can not agree with you. |
It's worth to mention to be fair you should publish this cleaned data just to let other researchers to validate it. |
Don't get my wrong, I think it's great that you made everything clear and open source and as a fellow researcher I thank you for that! As you say, it's very easy to cheat on these type of challenges. However, the assumption is that all the teams would work with the same test set and would not tamper the submitted results. Otherwise, as I said before, it would be impossible to compare the accuracy between algorithms. I agree with you that the best way to test is with a clean dataset and your paper clearly shows how this affect the accuracy of the algorithm. My problem with your submission is that nowhere in the MegaFace website says that you used a cleaned test set. Neither there is a link to this repo or to your paper, where this information is provided. In my opinion, you should let the organisers know about this, so they can decide what to do. I think adding a note under the "Method details" section in the MegaFace website could work as well. |
I think this affects the companies and the business of face technologies which cost us a huge budget, to be fair I believe you should delete this repository. |
@MartinDione I'm not sure what are you talking about but that has nothing to do with this discussion |
I'm talking about the megaface objectives to develop the facial recognition. When you publish such this code you affect other companies, like Vocord they invested a lot of money on the development to achieve state-of-the-art performance on MegaFace. |
I won't even bother replying to that... Again, that has nothing to do with the issue we are discussing here. |
We will put the noise list on this repo soon. But read the notes carefully at that time. |
@MartinDione It is unbelievable if vocord spend a lot of money on a public competition like MegaFace. |
@MartinDione Vocord didn't spend it :) |
It was described in our paper. |
@nttstar I have read your article and now I absolutely disagree that you are playing fair game. First of all you have changed test dataset, not deleted smth wrong, but changed FaceScrub "wrong" images with another images! More over you are writing that additional artificial features was added to your feature vectors that highlight "bad" images in distractor dataset. It is the same as using manual annotation! Megaface rules deny the first and the second. Also I would like to mention that your method of "cleaning" dataset creates "clear for your algorithm" dataset. I am sure that community will find mistakes in your "error" list when you publish it. P.S. And have you thought that Megaface team have clear correspondence list? If they will recompute the results with it your team will take the last place cause images of different people will give you very high FAR (it is about your image replacing) |
@ivazhu I'm confused about the words 'deleted' and 'changed' in your comment. Anyway what you said was almost the same with d4nst's and I don't want to clear my position again. Megaface team had checked our list and solution, otherwise the result would not be on the leaderboard. |
@nttstar From your article: "During testing, we change the noisy face to another right face" and "During testing, we add one additional feature dimension to distinguish these noisy faces" In your article you declare that you are not using provided dataset and augmenting features with manual labels. It's obvious for all that you broke the Megaface rules. |
@ivazhu How can you achieve 91% without removing these noises? It's beyond my imagination. |
@nttstar In Your text, "We manually clean the FaceScrub dataset and finally In your noise list, megaface_noises.txt has 719 noisy face images and Facescrub has 605 noisy face images. your code for remove noise in facescrub set:
|
@nttstar First of all - WE DIDN'T CHANGE THE DATASET as you did. There are some secrets :) For instance, think what to do if you see more than one face on "error" distractor image. And also, as I promised, take a look on Alley_Mills_52029, Lindsay_Hartley_33188, Michael_Landes_43643, ... These are not ERRORS. These are errors of your algorithm. In your "work" you simply deleted all samples which your alg was not working correctly on. Any more questions? |
@ivazhu , |
It can be more than one face on a image
ср, 28 февр. 2018 г. в 11:41, Chi Ho CHAN <notifications@github.com>:
… @ivazhu <https://github.com/ivazhu> ,
What do mean "more than one face on "error" distractor image"? In
verification, it is a pair matching, how can it be more than a face? Also
what do you mean "one face on "error" distractor image"?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#49 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFqfHEGNeCkA4gdC53pVDicmjULn-fPGks5tZS1jgaJpZM4SEPY7>
.
|
Do you mean that there can be more than a face in the image, no matter it is in facescrub subset or distractor set? |
We didn't anything with Facescrub dataset cause there is not FAIR method to
correct this type of errors.
What about distractors - yes, there are some samples with more than one
face - one face from Facescrub and one another.
About megaface json - read megaface docs - you should use megaface json in
the case you can't detect a face only
ср, 28 февр. 2018 г. в 11:50, Chi Ho CHAN <notifications@github.com>:
… Do you mean that there can be more than a face in the image, no matter it
is in facescrub subset or distractor set?
So you do not use their provided json as a landmark reference in your case.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#49 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFqfHCu6Z7Q4ZS3izuVLAuj6mhg0nCPfks5tZS9dgaJpZM4SEPY7>
.
|
I see. I think this can be a discussion. Originally, I thought json files is to tell participant where is a face in an image, so that alg. computes the similarity score of those faces (not images). Perhaps, some face locations in json files are incorrect, but this is the groundtruth of this challenge and therefore we should based on these error to provide our score. Anyway, your trick may also not be good as you only do on mageface dataset, which means that any images more than one face can be denoted as disaster. In other words, you have already known the side information that one of your pair is disaster. so you can do whatever to make it as a good score for mismatch pair. I think this also violates the verification protocol as alg. should make a score for a testing pair without having any side information. |
@ivazhu Choosing one face from multiple faces in one image according to your own knowledge is also a data trick. It is the same as data noises cleaning. |
@chichan01 Adding random noises to centre vectors can avoid identical feature vectors. The result will not change if no noise applied. |
@nttstar |
I wonder how then images from facescrub would be compared!
P.S. It is very funny to hear about 100% from the man who added a dimension
with labels to his features
ср, 28 февр. 2018 г. в 16:25, Jia Guo <notifications@github.com>:
… @ivazhu <https://github.com/ivazhu> I can get 100% on Megaface challenge
by using your approach - to detect dozens of faces from each megaface
distractor image and choose the one with lowest similarity to facescrub
dataset, even sometimes it's not a real face. You can't use the knowledge
of facescrub dataset when you're doing something with megaface.
I'm very surprised that you and vocord do think this is a normal procedure.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#49 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFqfHBpwgLiIYSIXOBxpv8hF9ChKuarrks5tZW8QgaJpZM4SEPY7>
.
|
I think this whole discussion proves my point. Both of your teams (Deepinsight and Vocord) have not strictly followed the MegaFace protocol, so it is pointless to compare the performance of your algorithms with the rest of participants. |
Daniel, please show where I didn't follow the protocol?
ср, 28 февр. 2018 г. в 19:37, Daniel Saez <notifications@github.com>:
… I think this whole discussion proves my point. Both of your teams
(Deepinsight and Vocord) have not strictly followed the MegaFace protocol,
so it is pointless to compare the performance of your algorithms with the
rest of participants.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#49 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFqfHK46ermo9v2k_Z5y6JwTy-dfXu4_ks5tZZz3gaJpZM4SEPY7>
.
|
@nttstar and @chichan01 have already explained this, but I'll try to make it more clear... As I understand, this is what you do: in the FaceScrub set, you always crop the correct face (using the provided landmarks or bounding box as a reference). In the MegaFace distractor set, you are cropping all the detected faces, comparing them against the probe face and selecting the lowest score as the "valid" score. The problem with your approach is that you are using your knowledge about the origin of the image (probe set or distractor set) to make a decision. You know that a probe and a distractor image shouldn't match, so you just take the lowest score. As others have pointed out, you could use a poor face detector that doesn't even detect faces and this approach would give you a very high accuracy. If you really wanted to take the lowest score, you should do it all the time, not just for distractor images, i.e. when you add the matching face from the probe set to the distractor set, you should also compare against all the detected faces and take the lowest score. If you do that, your performance will probably be much worse. Lastly, just think about a real identification system in which you don't know anything about the origin of the faces. I'm sure that you would agree with me that always selecting the lowest score from all the detected faces would be a very poor design. Please let me know if my assumptions about your approach are wrong. |
Not quite. First of all I compute detector settings so that there is no
fail detections in FaceScrub dataset. It let me hope that there is no such
detections in Megaface dataset. To tell the truth I checked it manually -
there is no fail detections in Megaface with computed settings. So I just
choose the real distractor from two real face.
There is another more important problem - there are images without faces in
Megaface dataset. Nevertheless we had to compute "features" by
json-coordinate. These "features" are something completely different from
face feature. I'd like to mention that the better learned algorithm than
worse results would be on such samples.
P.S. Daniel, we don't talk about real identification system here. There are
very different problem there :)
2018-02-28 22:06 GMT+03:00 Daniel Saez <notifications@github.com>:
… @nttstar <https://github.com/nttstar> and @chichan01
<https://github.com/chichan01> have already explained this, but I'll try
to make it more clear...
As I understand, this is what you do: in the FaceScrub set, you always
crop the correct face (using the provided landmarks or bounding box as a
reference). In the MegaFace distractor set, you are cropping all the
detected faces, comparing them against the probe face and selecting the
lowest score as the "valid" score.
The problem with your approach is that you are using your knowledge about
the origin of the image (probe set or distractor set) to make a decision.
You know that a probe and a distractor image shouldn't match, so you just
take the lowest score. As others have pointed out, you could use a poor
face detector that doesn't even detect faces and this approach would give
you a very high accuracy.
If you really wanted to take the lowest score, you should do it all the
time, not just for distractor images, i.e. when you add the matching face
from the probe set to the distractor set, you should also compare against
all the detected faces and take the lowest score. If you do that, your
performance will probably be much worse.
Lastly, just think about a real identification system in which you don't
know anything about the origin of the faces. I'm sure that you would agree
with me that always selecting the lowest score from all the detected faces
would be a very poor design.
Please let me know if my assumptions about your approach are wrong.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#49 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFqfHF5Aa6xp95x8XK4_QjrfEl11I9pKks5tZaO2gaJpZM4SEPY7>
.
|
@nttstar I have just found that you also released your result on FGNet megaface challenge as well. I have questions regarding on your result.
|
@ivazhu My point and others @d4nst and @nttstar agree that you only treat the disaster set specially and this is not the case in the real scenario as we don't have any side information of the image pair. On the other hand, your work will be much more difficult to follow and reproduce as you do not release things. Luckily, you pop up here so that I understand a bit of your work. Also, the most important is that your proposed trick violates the fundamental principle of biometric verification and identification. |
It is a common knowledge in face recognition community that the absolute performance of MegaFace is meaningless. Lots of cheating tricks can be applied to achieve very high scores and MegaFace don't have the mechanism to prevent them. What we can trust are the relative scores only, which means, one can only compare with himself on MegaFace. As in this issue, models evaluated on the cleaned list should only be compared with the ones evaluated on the same list. The authors have these experiments in their paper. That's already enough, not to mention that they released their codes and will release the cleaned list. The only problem is that the official MegaFace organizer should make two leaderboards after they got aware of the existence of the "cleaned list". However, they didn't have the willings to do so and they chose to put them on the same leaderboard. This is the problem. The authors of InsightFace did nothing wrong. |
appreciate Vocoord let us know what they have done to the dataset |
@ivazhu the point of my comment was not about the failed detections. It was about treating probe and distractor images in a different way when in a real setting you wouldn't have this information. |
|
@d4nst |
@nttstar Yes, I have realised that. However, that doesn't make things any better. The MegaFace leaderboard seems pretty meaningless to me now. We need a reliable and standard benchmark similar to ImageNet in the face recognition community |
Daniel, see the megaface forum - I have suggested it yet. Moreover we
offered the megaface team to clean up their dataset two years ago. Also guy
from google wrote that he informed them about the errors at the beginning
of the challenge.
I suggest to discuss new challenge at some big conference. There are a lot
of problems to create fair for all competitions.
чт, 1 марта 2018 г. в 9:58, Daniel Saez <notifications@github.com>:
… @nttstar <https://github.com/nttstar> Yes, I have realised that. However,
that doesn't make things any better. The MegaFace leaderboard seems pretty
meaningless to me now. We need a reliable benchmark similar to ImageNet in
the face recognition community
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#49 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFqfHC--_Eq_3vkKqVH59-Su_YrJ9X_sks5tZ7hOgaJpZM4SEPY7>
.
|
Another big issue I want to say here is: Test set(megaface distractor set) cleaning is a MUST-DO work to test your algorithms/models, otherwise the results will not be solid even you just compare the relative scores. Take an example from our paper. |
Jia, it may be due to your cleaning process and its errors. Except errors
in your error list there are some "bad" (not corresponding) images that you
didn't catch. For example:
Peggy_McCay_48889
Adrienne_Barbeau_4246
Alley_Mills_52003
So I can say that you didn't prove the statement above
2018-03-01 16:31 GMT+03:00 Jia Guo <notifications@github.com>:
… Another big issue I want to say here is: Test set(megaface distractor set)
cleaning is a MUST-DO work to test your algorithms/models, otherwise the
results will not be solid even you just compare the relative scores. Take
an example from our paper. SphereFace(m=4, lambda=5) achieves 82.95%,
97.43% on 'before-cleaning' and 'after-cleaning' protocols respectively,
while ArcFace(m=0.4) gets 82.29% and 98.10%(ArcFace(m=0.5) gets 98.36%
for reference). Watching the scores before refinement we may judge that
SphereFace is better than ArcFace under m=0.4 setting but actually not.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#49 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFqfHHvscaBNce3qwc-oTKNa6MYdWSzsks5tZ_g6gaJpZM4SEPY7>
.
|
FaceScrub errors are all fixed in my experiments, it does not affect. Also, we have only about 1% acc improvement after fixing FaceScrub identity errors, much less than megaface's. It is not worth to mention here. |
*Peggy_McCay_48889*
*Adrienne_Barbeau_4246*
*Alley_Mills_52003*
These are FaceScrub errors. Did you fix them? If yes, why your error list
doesn't contain them?
2018-03-01 16:41 GMT+03:00 Jia Guo <notifications@github.com>:
… FaceScrub errors are all fixed in my experiments, it does not affect.
Also, we have only about 1% acc improvement after fixing FaceScrub identity
errors, much less than megaface's. It is not worth to mention here.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#49 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFqfHMZS3_zIxB52s2lp1hWMgdH38X5dks5tZ_qMgaJpZM4SEPY7>
.
|
You can open a standalone issue to describe it. Noisy list must be refined over and over, not just from our team. |
1. These three names are only examples of your errors
2. There are only 3530 images from FaceScrub in Megaface protocol. And I
think (I know the real number) )the intersection with 600 "facescrub
noises" is about 20-30 images. So even three noises (and I repeat that
there are more such cases) are very important.
3. Jia, you deleted several not noise images in your experiments. After
that your "big issue" can be explained by these mistakenly deleted images.
2018-03-01 16:54 GMT+03:00 Jia Guo <notifications@github.com>:
… You can open a standalone issue to describe it. Noisy list must be refined
over and over, not just from our team.
And this three items will not lead to much differences(<0.1% I believe, as
600 facescrub noises just increase the acc by about 1%). I don't think it
will break my argument above.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#49 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFqfHB3hjOeAzciIua5t4a3yCIlppI4dks5tZ_2PgaJpZM4SEPY7>
.
|
@ivazhu I'm wasting my time to reply to you. From +0.66% to -0.67%, so you mean that your three items can affect the acc by larger than 1% in one direction? Stop please. |
Jia, I think you are in the subject but it seems you are not. Think how
many positives pairs (pairs of images of the same person) you have. After
that look how many images of *Peggy_McCay *there are in the Facescrub
subset from Megaface. Even if there are only one not detected by you error
then there are N positives pairs which are not positive. But you managed to
miss more than one error *Peggy_McCay *image!
2018-03-01 17:11 GMT+03:00 Jia Guo <notifications@github.com>:
… @ivazhu <https://github.com/ivazhu> I'm wasting my time to reply to you.
From +0.66% to -0.67%, so you mean that your three items can affect the acc
by larger than 1% in one direction? Stop please.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#49 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFqfHDNJFJT4zUEYVNhW6LOQWeXVJKtFks5taAGhgaJpZM4SEPY7>
.
|
I will give you the results soon. |
I can attest to the statement by @nttstar that denoising the facescrub probe set only changes the final score 1%. Out of the 3530 probes, 24 show up on the noise list from @nttstar, and 24 is a small portion of 3530. More importantly, if you check those pictures, they are mostly still celebrities, and should not find matches with any distractors. In other words, you end up with less score increase than what 24/3530 might indicate. The reason why 707 mislabeld distractors play such a huge influence on the final score, is that you are not comparing 707 with 1M distractors, you are comparing them with 3530 probes. If any of the mislabeled distractors gets into a higher similarity than the probe's similarity, your rank 1 score is ruined. In that case, rank 5, 10 might be a better metric, because that allows for some noises. But megaface's devkit only prints out rank 1 score. Oh well. |
It's a very good point. Having a public noise list and letting everyone examine, and follow suit is the right way to go. If someone is organizing a new contest, I think the organizer should have the following rules: Ideally each pic contains 1 face. Remove wrongly labeled ones. For pics containing more than 1 face, make sure the provided face bbox corresponds to the correct face, and urge participants to use that bbox to generate landmarks, align and generate reps. Urge participants to use the exact same procedure for images from probe and distractor sets. |
What do you guys think of NIST's FRVT Ongoing https://www.nist.gov/programs-projects/face-recognition-vendor-test-frvt-ongoing for a benchmark? Many commercial vendors use it for quite a fair comparison of their face recognition algorithms. FRVT requires to use face detection and does not provide landmarks but I guess MTCNN does a decent job in solving this issue. |
Hi, Justas
The problem is that FRVT requires your algorithm without any kind of
protection. You give your alg to american government and they use it
whereever they want.
And another one for you - the recognition error for best algs on FRVT is
several times less than MTCNN detection error.
Ivan
2018-04-11 10:39 GMT+03:00 Justas Kranauskas <notifications@github.com>:
… What do you guys think of NIST's FRVT Ongoing
https://www.nist.gov/programs-projects/face-recognition-
vendor-test-frvt-ongoing for a benchmark? Many commercial vendors use it
for quite a fair comparison of their face recognition algorithms. FRVT
requires to use face detection and does not provide landmarks but I guess
MTCNN does a decent job in solving this issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#49 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFqfHHWtxNa411O2ERKNtTlczFeD0Cv4ks5tnbNEgaJpZM4SEPY7>
.
|
With the non-cleaned megaface and facescrub, all results above > ~85-86% are cheating in some ways(willful or unmeant). With the cleaned megaface, evaluations are unfair to other methods that are conducted on previous non-cleaned protocol, such as sphereface, tencentface, and papers before Mar. Angular Margins are very tricky, you can always mess up methods from others and pretend your method performs better. If possible, do not introduce more parameters, because parameters are always fit to the training datasets, if small datasets, no strong constraints used, if larger datasets, maybe stronger constraints, who knows?? Report all your results trained on CASIA, MS-Celebrities, VGGFace2, because most previous methods are trained on CASIA. That is no fair in face recognition area. Probably, blind test and protocol may be ok. Megaface protocol sucks.......... Let us try something else........... |
I have seen that the current top algorithm in the MegaFace challenge is iBug_DeepInsight, with an accuracy that corresponds with your latest update: 2018.02.13: We achieved state-of-the-art performance on MegaFace-Challenge-1, at 98.06
After reading your paper and the README in this repo, it seems to me that this accuracy is achieved using the cleaned/refined MegaFace dataset. Is this correct?
The text was updated successfully, but these errors were encountered: