Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iBUG_DeepInsight #49

Closed
d4nst opened this issue Feb 13, 2018 · 70 comments
Closed

iBUG_DeepInsight #49

d4nst opened this issue Feb 13, 2018 · 70 comments

Comments

@d4nst
Copy link

d4nst commented Feb 13, 2018

I have seen that the current top algorithm in the MegaFace challenge is iBug_DeepInsight, with an accuracy that corresponds with your latest update: 2018.02.13: We achieved state-of-the-art performance on MegaFace-Challenge-1, at 98.06

After reading your paper and the README in this repo, it seems to me that this accuracy is achieved using the cleaned/refined MegaFace dataset. Is this correct?

@nttstar
Copy link
Collaborator

nttstar commented Feb 14, 2018

Right.

@nttstar nttstar closed this as completed Feb 14, 2018
@d4nst
Copy link
Author

d4nst commented Feb 14, 2018

In that case, I don't think it's fair to publish those results in the public results page of the MegaFace challenge. As far as I know, it's not allowed to modify the evaluation set when reporting results, as this would obviously make impossible to compare the accuracy of the different algorithms. I suggest that you report the results obtained using the original dataset instead.

@nttstar
Copy link
Collaborator

nttstar commented Feb 14, 2018

I can not agree with you.
First, it is impossible to guarantee any submitted result is fair unless it has published paper and open source code. I can have dozens of methods to cheat on it.
Second, we followed all rules of MegaFace challenges but corrected the errors they made on distractor images. What we report is the real performance of any face recognition algorithm we had experimented. The accuracy can be random if we do not remove these distractor noises as we seen in our paper, which is actually not fair to compare the performance between two models/algorithms.
Last, we made all things clear and open source, compared all published state-of-the-art algorithms then demonstrated that our approach performed the best.

@nttstar nttstar reopened this Feb 14, 2018
@ghost
Copy link

ghost commented Feb 14, 2018

It's worth to mention to be fair you should publish this cleaned data just to let other researchers to validate it.

@d4nst
Copy link
Author

d4nst commented Feb 14, 2018

Don't get my wrong, I think it's great that you made everything clear and open source and as a fellow researcher I thank you for that!

As you say, it's very easy to cheat on these type of challenges. However, the assumption is that all the teams would work with the same test set and would not tamper the submitted results. Otherwise, as I said before, it would be impossible to compare the accuracy between algorithms. I agree with you that the best way to test is with a clean dataset and your paper clearly shows how this affect the accuracy of the algorithm. My problem with your submission is that nowhere in the MegaFace website says that you used a cleaned test set. Neither there is a link to this repo or to your paper, where this information is provided.

In my opinion, you should let the organisers know about this, so they can decide what to do. I think adding a note under the "Method details" section in the MegaFace website could work as well.

@ghost
Copy link

ghost commented Feb 14, 2018

I think this affects the companies and the business of face technologies which cost us a huge budget, to be fair I believe you should delete this repository.

@d4nst
Copy link
Author

d4nst commented Feb 14, 2018

@MartinDione I'm not sure what are you talking about but that has nothing to do with this discussion

@ghost
Copy link

ghost commented Feb 14, 2018

I'm talking about the megaface objectives to develop the facial recognition. When you publish such this code you affect other companies, like Vocord they invested a lot of money on the development to achieve state-of-the-art performance on MegaFace.

@d4nst
Copy link
Author

d4nst commented Feb 14, 2018

I won't even bother replying to that... Again, that has nothing to do with the issue we are discussing here.

@nttstar
Copy link
Collaborator

nttstar commented Feb 15, 2018

We will put the noise list on this repo soon. But read the notes carefully at that time.

@nttstar
Copy link
Collaborator

nttstar commented Feb 15, 2018

@MartinDione It is unbelievable if vocord spend a lot of money on a public competition like MegaFace.

@nttstar nttstar closed this as completed Feb 16, 2018
@ivazhu
Copy link

ivazhu commented Feb 21, 2018

@MartinDione Vocord didn't spend it :)
@nttstar And what have you done with errors in FaceScrub subset which is used in Megaface Challenge?

@nttstar
Copy link
Collaborator

nttstar commented Feb 22, 2018

It was described in our paper.

@ivazhu
Copy link

ivazhu commented Feb 22, 2018

@nttstar I have read your article and now I absolutely disagree that you are playing fair game. First of all you have changed test dataset, not deleted smth wrong, but changed FaceScrub "wrong" images with another images! More over you are writing that additional artificial features was added to your feature vectors that highlight "bad" images in distractor dataset. It is the same as using manual annotation! Megaface rules deny the first and the second. Also I would like to mention that your method of "cleaning" dataset creates "clear for your algorithm" dataset. I am sure that community will find mistakes in your "error" list when you publish it.

P.S. And have you thought that Megaface team have clear correspondence list? If they will recompute the results with it your team will take the last place cause images of different people will give you very high FAR (it is about your image replacing)

@nttstar
Copy link
Collaborator

nttstar commented Feb 22, 2018

@ivazhu I'm confused about the words 'deleted' and 'changed' in your comment. Anyway what you said was almost the same with d4nst's and I don't want to clear my position again. Megaface team had checked our list and solution, otherwise the result would not be on the leaderboard.

@ivazhu
Copy link

ivazhu commented Feb 22, 2018

@nttstar From your article: "During testing, we change the noisy face to another right face" and "During testing, we add one additional feature dimension to distinguish these noisy faces"
From Megaface Challenge: "1. Download MegaFace and FaceScrub datasets and development kit 2.Run your algorithm to produce features for both datasets"

In your article you declare that you are not using provided dataset and augmenting features with manual labels. It's obvious for all that you broke the Megaface rules.

@nttstar
Copy link
Collaborator

nttstar commented Feb 28, 2018

@ivazhu How can you achieve 91% without removing these noises? It's beyond my imagination.

@chichan01
Copy link

chichan01 commented Feb 28, 2018

@nttstar
I have read your paper and also your noise list and codes under https://github.com/deepinsight/insightface/tree/master/src/megaface as well. I got a bit confuse.

In Your text, "We manually clean the FaceScrub dataset and finally
find 605 noisy face images. During testing, we change the
noisy face to another right face, which can increase the identification
accuracy by about 1%. In Figure 6(b), we give the
noisy face image examples from the MegaFace distractors.
All of the four face images from the MegaFace distractors
are Alec Baldwin. We manually clean the MegaFace distractors
and finally find 707 noisy face images. During testing,
we add one additional feature dimension to distinguish
these noisy faces, which can increase the identification accuracy
by about 15%."

In your noise list, megaface_noises.txt has 719 noisy face images and Facescrub has 605 noisy face images.
In remove_noises.py,
for facescrub set, the noisy image(feature) is replaced by the subject class center with random uniform noise. Do you really need random noise there? why?


your code for remove noise in facescrub set:
center = fname2center[a]
g = np.zeros( (feature_dim+feature_ext,), dtype=np.float32)
g2 = np.random.uniform(-0.001, 0.001, (feature_dim,))
g[0:feature_dim] = g2
f = center+g
_norm=np.linalg.norm(f)
f /= norm
feature_path_out = os.path.join(args.facescrub_feature_dir_out, a, "%s
%s.bin"%(b, out_algo))
write_bin(feature_path_out, f)

However, for Megaface set, I don't know what you do in there. My first reading seems that you try to fill the feature with 100 for those noise images, but after I read your load_bin function, it is not the case as you update those filled 100 feature with the original extracted feature from the noise image.

your code about noise in megaface:
feature = load_bin(feature_path, 100.0)
write_bin(feature_path_out, feature)

and load_bin function:
def load_bin(path, fill = 0.0):
with open(path, 'rb') as f:
bb = f.read(4*4)
#print(len(bb))
v = struct.unpack('4i', bb)
#print(v[0])
bb = f.read(v[0]*4)
v = struct.unpack("%df"%(v[0]), bb)
feature = np.full( (feature_dim+feature_ext,), fill, dtype=np.float32)
feature[0:feature_dim] = v
#feature = np.array( v, dtype=np.float32)
#print(feature.shape)
#print(np.linalg.norm(feature))
return feature

  1. @ivazhu @nttstar Is there something I misunderstand? please give an advice. (it seems that your code is not what you describe in the paper.!)

  2. did you use your codes and list to reproduce your result for megaface or it is typo error? if you have update your code or lists, would you like to tell us what is your updated result on megaface. Please verify it on your pretrained model.

@ivazhu
Copy link

ivazhu commented Feb 28, 2018

@nttstar First of all - WE DIDN'T CHANGE THE DATASET as you did. There are some secrets :) For instance, think what to do if you see more than one face on "error" distractor image.

And also, as I promised, take a look on Alley_Mills_52029, Lindsay_Hartley_33188, Michael_Landes_43643, ... These are not ERRORS. These are errors of your algorithm. In your "work" you simply deleted all samples which your alg was not working correctly on.

Any more questions?

@chichan01
Copy link

chichan01 commented Feb 28, 2018

@ivazhu ,
What do mean "more than one face on "error" distractor image"? In verification, it is a pair matching. also you do not know where image is from distractor or gallery(facescrub).

@ivazhu
Copy link

ivazhu commented Feb 28, 2018 via email

@chichan01
Copy link

Do you mean that there can be more than a face in the image, no matter it is in facescrub subset or distractor set?
So you do not use their provided json as a landmark reference in your case.

@ivazhu
Copy link

ivazhu commented Feb 28, 2018 via email

@chichan01
Copy link

chichan01 commented Feb 28, 2018

I see. I think this can be a discussion. Originally, I thought json files is to tell participant where is a face in an image, so that alg. computes the similarity score of those faces (not images). Perhaps, some face locations in json files are incorrect, but this is the groundtruth of this challenge and therefore we should based on these error to provide our score.

Anyway, your trick may also not be good as you only do on mageface dataset, which means that any images more than one face can be denoted as disaster. In other words, you have already known the side information that one of your pair is disaster. so you can do whatever to make it as a good score for mismatch pair. I think this also violates the verification protocol as alg. should make a score for a testing pair without having any side information.
Finally, is it what you did on your submitted work on megaface?

@nttstar
Copy link
Collaborator

nttstar commented Feb 28, 2018

@ivazhu Choosing one face from multiple faces in one image according to your own knowledge is also a data trick. It is the same as data noises cleaning.

@nttstar
Copy link
Collaborator

nttstar commented Feb 28, 2018

@chichan01 Adding random noises to centre vectors can avoid identical feature vectors. The result will not change if no noise applied.

@chichan01
Copy link

@nttstar
so would you like to tell me what you want to do on the disaster set..?

@ivazhu
Copy link

ivazhu commented Feb 28, 2018 via email

@d4nst
Copy link
Author

d4nst commented Feb 28, 2018

I think this whole discussion proves my point. Both of your teams (Deepinsight and Vocord) have not strictly followed the MegaFace protocol, so it is pointless to compare the performance of your algorithms with the rest of participants.

@ivazhu
Copy link

ivazhu commented Feb 28, 2018 via email

@d4nst
Copy link
Author

d4nst commented Feb 28, 2018

@nttstar and @chichan01 have already explained this, but I'll try to make it more clear...

As I understand, this is what you do: in the FaceScrub set, you always crop the correct face (using the provided landmarks or bounding box as a reference). In the MegaFace distractor set, you are cropping all the detected faces, comparing them against the probe face and selecting the lowest score as the "valid" score.

The problem with your approach is that you are using your knowledge about the origin of the image (probe set or distractor set) to make a decision. You know that a probe and a distractor image shouldn't match, so you just take the lowest score. As others have pointed out, you could use a poor face detector that doesn't even detect faces and this approach would give you a very high accuracy.

If you really wanted to take the lowest score, you should do it all the time, not just for distractor images, i.e. when you add the matching face from the probe set to the distractor set, you should also compare against all the detected faces and take the lowest score. If you do that, your performance will probably be much worse.

Lastly, just think about a real identification system in which you don't know anything about the origin of the faces. I'm sure that you would agree with me that always selecting the lowest score from all the detected faces would be a very poor design.

Please let me know if my assumptions about your approach are wrong.

@ivazhu
Copy link

ivazhu commented Feb 28, 2018 via email

@chichan01
Copy link

chichan01 commented Feb 28, 2018

@nttstar I have just found that you also released your result on FGNet megaface challenge as well. I have questions regarding on your result.

  1. Do you only use the same training set (i.e. your provided MS-1M-celeb) to train the deep network and test on both Facescrub and FGNet megaface challenges?
  2. If yes, it means that I can verify your pretrained model on both challenges. would you mind to tell me which pretrained model you released for testing on FGNet.?
  3. did you also clean the testing dataset as well? if yes, would you mind to release the list?
  4. will you update your paper by including the results of FGNet?

@chichan01
Copy link

chichan01 commented Feb 28, 2018

@ivazhu
Would you mind to address what different problems between the megaface challenge and the real identification and verification here??

My point and others @d4nst and @nttstar agree that you only treat the disaster set specially and this is not the case in the real scenario as we don't have any side information of the image pair.
I agree that @nttstar result cannot be comparable with other works because other participants did not do that, but they give the list, some codes and pretrained models, so that we can regard that they propose a new protocol and also verify their work. Therefore, the former and the later participants can do a bit extra work to follow this new protocol to produce the result if they want to.

On the other hand, your work will be much more difficult to follow and reproduce as you do not release things. Luckily, you pop up here so that I understand a bit of your work. Also, the most important is that your proposed trick violates the fundamental principle of biometric verification and identification.

@happynear
Copy link

happynear commented Mar 1, 2018

It is a common knowledge in face recognition community that the absolute performance of MegaFace is meaningless. Lots of cheating tricks can be applied to achieve very high scores and MegaFace don't have the mechanism to prevent them.

What we can trust are the relative scores only, which means, one can only compare with himself on MegaFace. As in this issue, models evaluated on the cleaned list should only be compared with the ones evaluated on the same list. The authors have these experiments in their paper. That's already enough, not to mention that they released their codes and will release the cleaned list.

The only problem is that the official MegaFace organizer should make two leaderboards after they got aware of the existence of the "cleaned list". However, they didn't have the willings to do so and they chose to put them on the same leaderboard. This is the problem. The authors of InsightFace did nothing wrong.

@HaoLiuHust
Copy link

appreciate Vocoord let us know what they have done to the dataset

@d4nst
Copy link
Author

d4nst commented Mar 1, 2018

@ivazhu the point of my comment was not about the failed detections. It was about treating probe and distractor images in a different way when in a real setting you wouldn't have this information.

@nttstar
Copy link
Collaborator

nttstar commented Mar 1, 2018

@chichan01

  1. Yes. The refined MS1M
  2. Yes. By using resnet100 model. You can get similar performance.
  3. Some mis-aligned faces are rectified and some potential wrong labels are checked. However, on FGNET, the lablel noise has smaller affect on the performance compared to that on FaceScrub. Due to the age span, noisy label selection is hard, and J Deng is confirming some of these noise labels. You can send email to enquiry with him.
  4. Maybe. We may update the results on FGNET into the paper.

@nttstar
Copy link
Collaborator

nttstar commented Mar 1, 2018

@d4nst
As you can see from this thread, I believe lots of submitted results were using cleaned list implicitly or explicitly, especially those >85% acc. Some teams even do not realize it.

@d4nst
Copy link
Author

d4nst commented Mar 1, 2018

@nttstar Yes, I have realised that. However, that doesn't make things any better. The MegaFace leaderboard seems pretty meaningless to me now. We need a reliable and standard benchmark similar to ImageNet in the face recognition community

@ivazhu
Copy link

ivazhu commented Mar 1, 2018 via email

@nttstar
Copy link
Collaborator

nttstar commented Mar 1, 2018

Another big issue I want to say here is: Test set(megaface distractor set) cleaning is a MUST-DO work to test your algorithms/models, otherwise the results will not be solid even you just compare the relative scores. Take an example from our paper. SphereFace(m=4, lambda=5) achieves 82.95%, 97.43% on 'before-cleaning' and 'after-cleaning' protocols respectively, while ArcFace(m=0.4) gets 82.29% and 98.10%(ArcFace(m=0.5) gets 98.36% for reference). Watching the scores before refinement we may judge that SphereFace is better than ArcFace under m=0.4 setting but actually not.

@ivazhu
Copy link

ivazhu commented Mar 1, 2018 via email

@nttstar
Copy link
Collaborator

nttstar commented Mar 1, 2018

FaceScrub errors are all fixed in my experiments, it does not affect. Also, we have only about 1% acc improvement after fixing FaceScrub identity errors, much less than megaface's. It is not worth to mention here.

@ivazhu
Copy link

ivazhu commented Mar 1, 2018 via email

@nttstar
Copy link
Collaborator

nttstar commented Mar 1, 2018

You can open a standalone issue to describe it. Noisy list must be refined over and over, not just from our team.
And this three items will not lead to much differences(<0.1% I believe, as 600 facescrub noises just increase the acc by about 1%). I don't think it will break my argument above.

@ivazhu
Copy link

ivazhu commented Mar 1, 2018 via email

@nttstar
Copy link
Collaborator

nttstar commented Mar 1, 2018

@ivazhu I'm wasting my time to reply to you. From +0.66% to -0.67%, so you mean that your three items can affect the acc by larger than 1% in one direction? Stop please.

@ivazhu
Copy link

ivazhu commented Mar 1, 2018 via email

@nttstar
Copy link
Collaborator

nttstar commented Mar 1, 2018

I will give you the results soon.
EDIT:
SphereFace got 97.60% while ArcFace(m=0.4) got 98.28% after adding these three.
The conclusion is still the same. Thanks for sharing this.
What I said above is my point may be false iff SphereFace can obtain more than 1% improvement while ArcFace keeps the same by adding that three, which is hardly to happen.

@terencezl
Copy link

I can attest to the statement by @nttstar that denoising the facescrub probe set only changes the final score 1%. Out of the 3530 probes, 24 show up on the noise list from @nttstar, and 24 is a small portion of 3530. More importantly, if you check those pictures, they are mostly still celebrities, and should not find matches with any distractors. In other words, you end up with less score increase than what 24/3530 might indicate.

The reason why 707 mislabeld distractors play such a huge influence on the final score, is that you are not comparing 707 with 1M distractors, you are comparing them with 3530 probes. If any of the mislabeled distractors gets into a higher similarity than the probe's similarity, your rank 1 score is ruined. In that case, rank 5, 10 might be a better metric, because that allows for some noises. But megaface's devkit only prints out rank 1 score. Oh well.

@terencezl
Copy link

As you can see from this thread, I believe lots of submitted results were using cleaned list implicitly or explicitly, especially those >85% acc. Some teams even do not realize it.

It's a very good point. Having a public noise list and letting everyone examine, and follow suit is the right way to go. If someone is organizing a new contest, I think the organizer should have the following rules:

Ideally each pic contains 1 face. Remove wrongly labeled ones. For pics containing more than 1 face, make sure the provided face bbox corresponds to the correct face, and urge participants to use that bbox to generate landmarks, align and generate reps.

Urge participants to use the exact same procedure for images from probe and distractor sets.

@Liuftvafas
Copy link

What do you guys think of NIST's FRVT Ongoing https://www.nist.gov/programs-projects/face-recognition-vendor-test-frvt-ongoing for a benchmark? Many commercial vendors use it for quite a fair comparison of their face recognition algorithms. FRVT requires to use face detection and does not provide landmarks but I guess MTCNN does a decent job in solving this issue.

@ivazhu
Copy link

ivazhu commented Apr 11, 2018 via email

@delveintodetail
Copy link

delveintodetail commented Apr 27, 2018

With the non-cleaned megaface and facescrub, all results above > ~85-86% are cheating in some ways(willful or unmeant).

With the cleaned megaface, evaluations are unfair to other methods that are conducted on previous non-cleaned protocol, such as sphereface, tencentface, and papers before Mar.

Angular Margins are very tricky, you can always mess up methods from others and pretend your method performs better. If possible, do not introduce more parameters, because parameters are always fit to the training datasets, if small datasets, no strong constraints used, if larger datasets, maybe stronger constraints, who knows??

Report all your results trained on CASIA, MS-Celebrities, VGGFace2, because most previous methods are trained on CASIA.

That is no fair in face recognition area. Probably, blind test and protocol may be ok.

Megaface protocol sucks.......... Let us try something else...........

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants