-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Negative patients impact on test FROC score... #272
Comments
Dear Thibault, regarding your side question: both values first compute TP and FP across the entire dataset (nothing is computed per sample). The TP are than "normalized" by the number of ground truth objects (i.e. classical sensitivity on object level) and the FP are normalized by the number of images. The behaviour when adding negative patients will change depending on the type of problem you are looking at and what kind of negative patient images are added. Some thoughts:
Highly recommend the metrics reloaded and metrics pitfalls papers to build intuition for these things. In the end, the best way to look into this problem more closely is to look into your images and the predictions :) |
Dear Michael, thank you for your complete answer.
Ok, thank you for the precision, it corresponds to what I thought.
Indeed, you are totally right actually: My intuition was strong, but I couldn't come up with any good arguments to defend it, so I did a little experiment with subgroups. I made subgroups by varying the proportion of healthy patients, and in a more general way (less binary or discontinuous), by varying the average number of lesions per scan in the subgroups. Always testing the same model on these different groups. The results are clear:
So my intuition was not only incorrect, but the opposite of reality... How can I explain this fallacious reasoning? I think it's due to the fact that of all the differences between LIDC and LUNA (part of it), the one that seemed the most obvious/trivial, the least “subtle”, to me, was the absence of healthy patients in LUNA. As I observed a difference in scores, I attributed it to that. The fact that apourchot had the same intuition here reinforced mine, and what's more, I obtained better results than LIDC in cross-validation on a proprietary external test dataset with no healthy patients... Well, intuition can definitively be a trap! General thoughts: FROC can be highly volatile in relation to non-obvious confounders, while still being the primary measure of detection. Do you agree? This makes me think that in detection, it's even more crucial to compare models and methods on the same data sets in paired fashion, compared to segmentation or classification, which can still be tricky but much easier to catch innately. Do you agree? Anyway, Thank you very much for all your answers every time! Really appreciate it! |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
❓ Question
Maybe related to Project-MONAI/tutorials#1582
Hello,
I observe that when adding negative patients (with no lesion to detect) for test or cross-validation, the FROC score is always decreased. When I ignore these patients for the test or cross-validation of the same trained model, the score is higher...
I tryed to explain myself this by the fact that adding negative patients can only add false positives (FP) and nothing else (no true negative in detection), not permitting to increase the sensitivity (Se), biasing it toward low values for fixed FP/scan. But a colleague challenged this explanation saying the following :
For example if we have a score of Se 80% @ 2FP/scan in average, if we add negative healthy patients, Se will remain the same, and we could expect 2FP/scan in average still...
Do you have elements of answer, or elements of explanation of this phenomenum? Did you observe this also?
Another side question: The X-axis (FP/scan) is computed at the sample level then averaged, but the Y-axis ? at the lesion level aggregated across samples, or in average as well? Maybe be it could help uderstanding?
Thank you very much.
Best,
Thibault
The text was updated successfully, but these errors were encountered: