-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add MNIST classification example notebook #442
Conversation
@ulupo Please don't make modification before it is ready. Merging and resolving conflicts on jupyter notebooks is hard enough. I will let you know once it's ready for your review! Thanks:) |
@gtauzin noted. I was just reacting to a request for review, apologies! |
Oops, sorry, I was not aware I made a request. No worries! |
@ulupo: I am now getting the dataset from openML and I adapted the notebook to the plotting API. I have also added some large-scale feature generation and a grid search :) Can you tell me what you think about the content I suggest? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a first few minor comments. Will look at the more important things tomorrow!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally happy with the content and I'll be happy to help refine the presentation too. The notebook does a great job at showing how to construct a highly nontrivial pipeline with a great number of different features created using TDA.
My main comment on the content is the following: I wonder if we could be a bit more sophisticated towards the end by showing how to use scikit-learn
tools for feature importance/feature selection as can be found e.g. here or here. Currently, a form of feature selection is illustrated at the end but it seems to amount to testing a subset of the 672 univariate models (RF on a single persistent entropy feature), to see which univariate model is the best; one could then rank them according to performance in validation, and only include the top N in a final multi-variate model which one would then train again. But features which are very correlated might perform similarly well, and our feature selection would not necessarily optimize for a "globally good" list of complementary features. More generally, the user might wonder how to perform feature selection on the multivariate problem directly.
examples/MNIST_classification.ipynb
Outdated
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"feature_union_filtrations.fit(X_train[:20])\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine taking 20 samples is just for illustrative purposes.
examples/MNIST_classification.ipynb
Outdated
" for n_iterations in n_iterations_dilation_list] \\\n", | ||
" + [SignedDistanceFiltration(n_iterations=n_iterations) \n", | ||
" for n_iterations in n_iterations_signed_list] \\\n", | ||
" + ['passthrough']\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remember to comment on the meaning of the passthrough option here, i.e. explain that it just captures peristence homology of a binary image which really is just homology.
examples/MNIST_classification.ipynb
Outdated
"diagram_steps = [[Binarizer(threshold=0.4), \n", | ||
" filtration, \n", | ||
" CubicalPersistence(homology_dimensions=[0, 1]), \n", | ||
" Scaler(metric='bottleneck')] \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a little strange that the scaler alone improves results substantially. Normally, I'd expect a scaler to be followed by a filter, but if it's not then can't the model weight take care of the different scales between homology dimensions?
examples/MNIST_classification.ipynb
Outdated
"]\n", | ||
"\n", | ||
"#\n", | ||
"feature_union = make_union(*[PersistenceEntropy()] + [Amplitude(**metric, order=None) \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a missing pair of brackets here? I.e. should this not be:
feature_union = make_union(*[[PersistenceEntropy()] + [Amplitude(**metric, order=None)
for metric in metric_list]])
?
Signed-off-by: Guillaume Tauzin guillaumetauzin.ut@gmail.com
Types of changes
Description
Add the MNIST full-blown ML example
Checklist
flake8
to check my Python changes.pytest
to check this on Python tests.