Best options choice for classification of small and unbalanced dataset #46

mattvan83 · 2019-10-28T08:54:17Z

Hi Pradeep,

For small and unbalanced dataset, do you recommend to use -t 0.8 or -t 0.9 ?

Isn't possible to deactivate in the implemented pipeline the feature selection? If not, what is the advantage of always using feature selection when dealing with a small features' dataset?

Best,
Matthieu

The text was updated successfully, but these errors were encountered:

raamana · 2019-10-28T13:46:19Z

-k all is equivalent to no feature selection.

there is no way to tell which percentage of training (80% or 90%) is best! Depending on the sample size, you want to ensure there is enough training (helps improve performance), while also ensuring reasonable test set sizes.. If the test set size is too small, violin plots will have large variance. So pick accordingly.

mattvan83 · 2019-10-28T13:59:35Z

Hi Pradeep,

I tried both of them and indeed violin plots have a large variance compared to the violin plots you show in the neuropredict documentation (I have 75 CN and 15 AD).

Below with -t 0.9:
balanced_accuracy.pdf

and below with -t 0.8:
balanced_accuracy.pdf

Based on these violin plots, the 80% training isn't it better (less variation)?
How could I determine the best set of features ? Just comparing median of the 3 violin plots of my above figures? Or are there other metrics to look at?
Where could I find mean balanced accuracy, sensitivity and specificity?
In these binary classification cases, aren't there ROC curves plotted?

raamana · 2019-10-28T16:39:46Z

no clear answers there - I'd report both (one in main, and other in supplementary?)
you can run siginificance tests on the data saved in CSV files - look in the exported_results folder
they are not exported by default - will add them to exported results soon.
Not all predictive models have a natural ROC associated with them, hence it's not produced by default. I'll implement it soon. Current results should be enough to include in your paper?

mattvan83 · 2019-10-29T09:30:01Z

Don't we need to privilege violin plot with less accuracy, so 80% one?
What kind of significance tests and which .csv files could I run these one?
OK, thanks. Is there any way I can for the moment deduce sensitivity and specificity based on actual produced results?
Yes.

raamana · 2019-10-29T11:29:35Z

Use the confusion matrices and the misclassification rate plots to deduce alternative performance metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best options choice for classification of small and unbalanced dataset #46

Best options choice for classification of small and unbalanced dataset #46

mattvan83 commented Oct 28, 2019

raamana commented Oct 28, 2019

mattvan83 commented Oct 28, 2019

raamana commented Oct 28, 2019

mattvan83 commented Oct 29, 2019

raamana commented Oct 29, 2019

Best options choice for classification of small and unbalanced dataset #46

Best options choice for classification of small and unbalanced dataset #46

Comments

mattvan83 commented Oct 28, 2019

raamana commented Oct 28, 2019

mattvan83 commented Oct 28, 2019

raamana commented Oct 28, 2019

mattvan83 commented Oct 29, 2019

raamana commented Oct 29, 2019