Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add MNIST classification example notebook #442

Closed
wants to merge 0 commits into from

Conversation

gtauzin
Copy link
Collaborator

@gtauzin gtauzin commented Aug 1, 2020

Signed-off-by: Guillaume Tauzin guillaumetauzin.ut@gmail.com

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Description
Add the MNIST full-blown ML example

Checklist

  • I have read the guidelines for contributing.
  • My code follows the code style of this project. I used flake8 to check my Python changes.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed. I used pytest to check this on Python tests.

@gtauzin gtauzin requested a review from ulupo August 1, 2020 13:44
@gtauzin gtauzin self-assigned this Aug 1, 2020
@gtauzin gtauzin changed the title Add first version of MNIST notebook WIP: Add first version of MNIST notebook Aug 1, 2020
examples/MNIST_classification.ipynb Outdated Show resolved Hide resolved
@gtauzin
Copy link
Collaborator Author

gtauzin commented Aug 1, 2020

@ulupo Please don't make modification before it is ready. Merging and resolving conflicts on jupyter notebooks is hard enough.

I will let you know once it's ready for your review! Thanks:)

@ulupo
Copy link
Contributor

ulupo commented Aug 1, 2020

@gtauzin noted. I was just reacting to a request for review, apologies!

@gtauzin
Copy link
Collaborator Author

gtauzin commented Aug 1, 2020

Oops, sorry, I was not aware I made a request. No worries!

@gtauzin
Copy link
Collaborator Author

gtauzin commented Aug 1, 2020

@ulupo: I am now getting the dataset from openML and I adapted the notebook to the plotting API. I have also added some large-scale feature generation and a grid search :)

Can you tell me what you think about the content I suggest?

@gtauzin gtauzin changed the title WIP: Add first version of MNIST notebook WIP: Add MNIST classification example notebook Aug 1, 2020
Copy link
Contributor

@ulupo ulupo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a first few minor comments. Will look at the more important things tomorrow!

examples/MNIST_classification.ipynb Outdated Show resolved Hide resolved
examples/MNIST_classification.ipynb Outdated Show resolved Hide resolved
examples/MNIST_classification.ipynb Outdated Show resolved Hide resolved
examples/MNIST_classification.ipynb Outdated Show resolved Hide resolved
examples/MNIST_classification.ipynb Outdated Show resolved Hide resolved
examples/MNIST_classification.ipynb Outdated Show resolved Hide resolved
examples/MNIST_classification.ipynb Outdated Show resolved Hide resolved
examples/MNIST_classification.ipynb Outdated Show resolved Hide resolved
examples/MNIST_classification.ipynb Outdated Show resolved Hide resolved
examples/MNIST_classification.ipynb Outdated Show resolved Hide resolved
Copy link
Contributor

@ulupo ulupo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally happy with the content and I'll be happy to help refine the presentation too. The notebook does a great job at showing how to construct a highly nontrivial pipeline with a great number of different features created using TDA.

My main comment on the content is the following: I wonder if we could be a bit more sophisticated towards the end by showing how to use scikit-learn tools for feature importance/feature selection as can be found e.g. here or here. Currently, a form of feature selection is illustrated at the end but it seems to amount to testing a subset of the 672 univariate models (RF on a single persistent entropy feature), to see which univariate model is the best; one could then rank them according to performance in validation, and only include the top N in a final multi-variate model which one would then train again. But features which are very correlated might perform similarly well, and our feature selection would not necessarily optimize for a "globally good" list of complementary features. More generally, the user might wonder how to perform feature selection on the multivariate problem directly.

"metadata": {},
"outputs": [],
"source": [
"feature_union_filtrations.fit(X_train[:20])\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine taking 20 samples is just for illustrative purposes.

" for n_iterations in n_iterations_dilation_list] \\\n",
" + [SignedDistanceFiltration(n_iterations=n_iterations) \n",
" for n_iterations in n_iterations_signed_list] \\\n",
" + ['passthrough']\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remember to comment on the meaning of the passthrough option here, i.e. explain that it just captures peristence homology of a binary image which really is just homology.

"diagram_steps = [[Binarizer(threshold=0.4), \n",
" filtration, \n",
" CubicalPersistence(homology_dimensions=[0, 1]), \n",
" Scaler(metric='bottleneck')] \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a little strange that the scaler alone improves results substantially. Normally, I'd expect a scaler to be followed by a filter, but if it's not then can't the model weight take care of the different scales between homology dimensions?

"]\n",
"\n",
"#\n",
"feature_union = make_union(*[PersistenceEntropy()] + [Amplitude(**metric, order=None) \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a missing pair of brackets here? I.e. should this not be:

    feature_union = make_union(*[[PersistenceEntropy()] + [Amplitude(**metric, order=None)
                                                          for metric in metric_list]])

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants