[ENH] Replace scikit-learn tSNE with faster implementation #3192

pavlin-policar · 2018-08-08T10:30:13Z

Issue

Scikit learn's implementation of tSNE is slow. It only supports the Barnes-Hut approximation. Adding new data to an existing embedding is not supported by and existing tSNE implementation.

Description of changes

Implement wrappers around my implementation of tSNE which includes both Barnes-Hut for small data sets and the interpolation based tSNE recently introduced which runs in linear time for larger data sets.

We can also now add new data points to an existing embedding by running optimization on those points only, w.r.t. the existing embedding.

Also, I've never packaged anything to pypi yet, and I've had a fair number of issues with this, but hopefully it's all ok now.

Includes

Code changes
Tests
Documentation

pavlin-policar · 2018-09-05T12:32:03Z

b646f6f fixes the Manifold Learning widget to work with the new tSNE wrappers. I've had to remove "jaccard", "mahalanobis" and "cosine" distances, since sklearn's fast neighbor search methods don't support them. The previous implementation did support them but for any reasonably sized data set, all pairwise distances had to be comptued, so this was very slow.

codecov-io · 2018-09-12T09:36:07Z

Codecov Report

Merging #3192 into master will increase coverage by 0.04%.
The diff coverage is 91.57%.

@@            Coverage Diff             @@
##           master    #3192      +/-   ##
==========================================
+ Coverage   82.21%   82.25%   +0.04%     
==========================================
  Files         351      351              
  Lines       62301    62442     +141     
==========================================
+ Hits        51219    51363     +144     
+ Misses      11082    11079       -3

astaric · 2018-09-12T10:25:20Z

Orange/widgets/unsupervised/tests/test_owmanifoldlearning.py

@@ -99,23 +99,22 @@ def test_singular_matrices(self):
        re-introduced, this test is very much required.

        """
+        # table = Table(


You can @Skip this test with the same description and uncomment the code.

Oh yes, that is so much nicer.

pavlin-policar · 2018-09-12T15:55:11Z

Travis fails for Python 3.4, which we don't support anymore, so this is not an issue. The failure occurs because the library I use for fast approximate nearest neighbor search pynndescent depends on numba, which does not support anything under Python 3.5. If depending on numba is problematic, this can be switched to a different ANN library later on.

Another issue with numba is that it does not support the parallel=True directive on 32bit systems, which we currently support. I've worked around this in the tSNE library by monkey patching (here) the @numba.njit decorator, which pynndescent uses extensively. My fix removes all numba acceleration from pynndescent, which really shoudn't be an issue because we can still fall back on exact search which uses sklearn's trees, and even still, asymptotically, pynndescent will still be faster than exact search.

In the event we add anything relying on numba (e.g. UMAP which is sometimes nicer and usually faster than even this implementation - though asymptotically both are linear in the number of points), both of these issues are certainly something to be aware of.

kernc · 2018-09-12T16:17:20Z

I've worked around this in the tSNE library by monkey patching (here) the @numba.njit decorator

Only for Python 2.7?

pavlin-policar · 2018-09-12T16:21:55Z

uns1 = sys.platform.startswith('win32') and sys.version_info[:2] == (2, 7)

checks the python version and

uns2 = sys.maxsize <= 2 ** 32

checks whether the system is a 32bit one. These are the two cases where numba doesn't support the parallel directive. I took this directly from numba, so I really only patch when necessary. I even kept the names to make this (somewhat) obvious.

kernc · 2018-09-12T16:38:18Z

For some reason, I read uns1 and uns2. Nevermind. 😳

kernc · 2018-09-12T16:42:54Z

If that's any better, you could patch out only the parallel=True argument by:

def __njit_wrapper(*args, parallel=True, **kwargs):
    """Discards `parallel` argument"""
    return __njit_copy(*args, **kwargs)

pavlin-policar · 2018-09-12T16:54:28Z

Yeah, I did try to be clever like that, but then other problems pop up. Apparently, the compiler for numba expects int64s and is surprised when that's not the default on 32bit systems. I didn't check whether this was baked into numba or just pynndescent, but that would have been really hard to patch up.

It ended up being a lot simpler to just remove numba acceleration. Who uses 32 bit systems anyways? 😄

pavlin-policar · 2018-09-17T13:15:49Z

So everything is in place, the only thing I'm still waiting on is the conda package to be properly set up. After that, this should be good to merge. Unfortunately, for the time being, conda-forge seems a bit broken.

conda-forge/staged-recipes#6659

pavlin-policar · 2018-10-21T16:55:54Z

I've now managed to get the package on conda-forge as well and have gotten the hang of all this. Currently, I put the requirement both into requirements-core.txt and conda-recipe/meta.yaml. Is this necessary? Should I only put it into the conda recipe?

markotoplak · 2018-10-22T09:03:03Z

Should I only put it into the conda recipe?

Anaconda Python is just one Python distribution. Orange should work with just setup.py (or pip, but without conda) too.

pavlin-policar · 2018-10-27T16:03:07Z

I haven't really been able to figure out why the test was failing on Windows, so I decided to drop sparse support for t-SNE.

IMO, there is actually a valid argument for this. Sparse data usually indicates that we have high dimensional data, and it is known that t-SNE doesn't scale well with ambient dimension. High dimensional input usually leads not only to poor visualizations, but much longer runtime. It is standard (and good) practice to reduce the dimension of the data via feature selection and/or PCA prior to embedding it with t-SNE. Running any kind of data through PCA will produce a dense output, therefore it is not the worst thing in the world to drop sparse support here.

This way, we make it impossible for the user to use t-SNE in a way we know to be bad and are encouraging users to use t-SNE in a way that will produce better visualizations.

…pdate

pavlin-policar changed the title ~~Replace scikit learn tSNE with faster implementation~~ [WIP] Replace scikit learn tSNE with faster implementation Aug 13, 2018

pavlin-policar mentioned this pull request Sep 5, 2018

tSNE: Update to use new implementation biolab/orange3-single-cell#292

Merged

3 tasks

pavlin-policar force-pushed the tsne branch 3 times, most recently from 3612342 to 0b43baf Compare September 11, 2018 13:58

pavlin-policar force-pushed the tsne branch from a0eb314 to 71b0375 Compare September 12, 2018 10:02

astaric reviewed Sep 12, 2018

View reviewed changes

lanzagar added this to the 3.16 milestone Sep 12, 2018

pavlin-policar force-pushed the tsne branch 3 times, most recently from 4ccefcd to 81b02ff Compare September 12, 2018 11:42

lanzagar modified the milestones: 3.16, 3.17 Sep 12, 2018

pavlin-policar force-pushed the tsne branch 3 times, most recently from 989e71c to ebdbe7b Compare September 12, 2018 13:20

pavlin-policar changed the title ~~[WIP] Replace scikit learn tSNE with faster implementation~~ Replace scikit learn tSNE with faster implementation Sep 12, 2018

pavlin-policar force-pushed the tsne branch 2 times, most recently from 1f63360 to 25b5cc6 Compare October 21, 2018 16:47

pavlin-policar force-pushed the tsne branch 5 times, most recently from 05975e8 to 588ccbf Compare October 26, 2018 13:21

lanzagar modified the milestones: 3.17, 3.18 Oct 26, 2018

pavlin-policar force-pushed the tsne branch 2 times, most recently from 1551408 to 4cc0f50 Compare October 27, 2018 15:47

lanzagar mentioned this pull request Oct 27, 2018

Remove tests (and support) for Python 3.4? #3341

Closed

pavlin-policar added 12 commits October 28, 2018 08:37

tSNE: Implement Orange wrapper around Pavlin's implementation of tSNE

f5eeaa6

tSNE: Remove old sklearn tests

3a8720a

OWManifold: Update to use new tSNE implementation

dc698ee

tsne: Update minimum version to 0.2 which uses numpy

a103f12

OWManifold: Add settings migration and fix failing tests after tsne u…

bcfd993

…pdate

manifold: Pylint fixes

91d6ca5

Disable numba parallel directive in appveyor

3322ef0

tSNE: Add tests that check for embedding correctness

890908b

Fix numba crash on 32bit windows

85f0a17

tSNE: Add conda requirement to fasttsne 0.2.9

951dcd5

OWManifoldLearning: Drop sparse support for t-SNE

bd34b37

OWManifoldLearning: Re-add chebyshev and jaccard distance to t-SNE

63a3dca

pavlin-policar force-pushed the tsne branch from 4cc0f50 to 63a3dca Compare October 28, 2018 07:37

pavlin-policar added 3 commits November 3, 2018 12:50

tSNE: Add exaggeration parameter new in v0.2.11

1dcb1d4

t-SNE: Update version to 0.1.12 to use sklearn-v0.20

ec27573

t-SNE: Pylint import order fixes

46f8506

lanzagar changed the title ~~Replace scikit learn tSNE with faster implementation~~ [ENH] Replace scikit-learn tSNE with faster implementation Nov 7, 2018

lanzagar merged commit ee460d5 into biolab:master Nov 8, 2018

pavlin-policar deleted the tsne branch November 8, 2018 10:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Replace scikit-learn tSNE with faster implementation #3192

[ENH] Replace scikit-learn tSNE with faster implementation #3192

pavlin-policar commented Aug 8, 2018 •

edited

Loading

pavlin-policar commented Sep 5, 2018

codecov-io commented Sep 12, 2018 •

edited by codecov bot

Loading

astaric Sep 12, 2018

pavlin-policar Sep 12, 2018

pavlin-policar commented Sep 12, 2018

kernc commented Sep 12, 2018

pavlin-policar commented Sep 12, 2018 •

edited

Loading

kernc commented Sep 12, 2018

kernc commented Sep 12, 2018

pavlin-policar commented Sep 12, 2018 •

edited

Loading

pavlin-policar commented Sep 17, 2018

pavlin-policar commented Oct 21, 2018

markotoplak commented Oct 22, 2018

pavlin-policar commented Oct 27, 2018

[ENH] Replace scikit-learn tSNE with faster implementation #3192

[ENH] Replace scikit-learn tSNE with faster implementation #3192

Conversation

pavlin-policar commented Aug 8, 2018 • edited Loading

Issue

Description of changes

Includes

pavlin-policar commented Sep 5, 2018

codecov-io commented Sep 12, 2018 • edited by codecov bot Loading

Codecov Report

astaric Sep 12, 2018

Choose a reason for hiding this comment

pavlin-policar Sep 12, 2018

Choose a reason for hiding this comment

pavlin-policar commented Sep 12, 2018

kernc commented Sep 12, 2018

pavlin-policar commented Sep 12, 2018 • edited Loading

kernc commented Sep 12, 2018

kernc commented Sep 12, 2018

pavlin-policar commented Sep 12, 2018 • edited Loading

pavlin-policar commented Sep 17, 2018

pavlin-policar commented Oct 21, 2018

markotoplak commented Oct 22, 2018

pavlin-policar commented Oct 27, 2018

pavlin-policar commented Aug 8, 2018 •

edited

Loading

codecov-io commented Sep 12, 2018 •

edited by codecov bot

Loading

pavlin-policar commented Sep 12, 2018 •

edited

Loading

pavlin-policar commented Sep 12, 2018 •

edited

Loading