Last few months I've been participating in the Kaggle Diabetic Retinopathy challenge. For a quick introduction see Wikipedia. We had to find an algorithm to grade the level of the disease given the images of the eyes of diabetic patients.
As the competition progressed I became more and more convinced that automatic screening would really be very helpful. The scoring system is quadratic weighted kappa What is interesting to note a few teams scored 85 or higher. According to the literature about Kappa, 85 means that our algorithm is VERY good. look here and here.
Now we come at the main issue. The algorithms had to match the labels provided by doctors.. However.. Doctors make mistakes.. and it turned out that sometimes the algorithm gets "downscored" while actually doing the right predictions.
That's why I dicided to put the confusion matrix online. The idea would be to allow people to comment on the predictions made by the algorihtm and the labels given by the docters. All in all we might get a better feel about the practical use of automatic screening. Perhaps github is not ideal for this. if you got a better platform.. Please feel free to fork it an make a better experience !
Rows : Labels given by doctors
Columns : Predicted labels by algorithm
Pred 0 | Pred 1 | Pred 2 | Pred 3 | Pred 4 | |
---|---|---|---|---|---|
Doctor 0 | 24502 | 1116 | 186 | 5 | 1 |
Doctor 1 | 1175 | 829 | 437 | 2 | 0 |
Doctor 2 | 490 | 791 | 3212 | 771 | 28 |
Doctor 3 | 8 | 18 | 140 | 565 | 142 |
Doctor 4 | 3 | 3 | 50 | 146 | 506 |
Perhaps the biggest problem is where the doctor scores a 2 and the computer a 0. This is problably due to the fact that one small symptom might at once make it a level 2. Examples are [big bleedings] (https://www.google.nl/search?q=hemorrhages&espv=2&biw=1920&bih=1055&source=lnms&tbm=isch&sa=X&ved=0CAYQ_AUoAWoVChMImp_m27L7xgIVBpQsCh28NgzO#tbm=isch&q=hemorhages+retinopathy) and Cotton wool spots
###Some examples that I encountered. There are many more of them!.