running on multiple cores? #18

joernroeder · 2016-04-13T21:28:16Z

Hey,
it's more a question than an actual issue: I'm mapping a dataset with 32dims x 900000items with tsne on a multi-core machine but as tsne is single threaded i'm just using one core. Do you have any tipps or tricks how i can split the dataset to parallelize computation?
thanks in advance!

lvdmaaten · 2016-04-14T15:24:37Z

There are basically two important loops that should be straightforward to parallelize:

https://github.com/lvdmaaten/bhtsne/blob/master/sptree.cpp#L385
https://github.com/lvdmaaten/bhtsne/blob/master/tsne.cpp#L215

The first loop is actually embarrassingly parallel, so it should be completely trivial. The second loop may be somewhat trickier because each iteration may access the same nodes of the tree, so it is somewhat less predictable what speedups you can get there.

joernroeder · 2016-04-16T21:56:25Z

@lvdmaaten thanks for the infos. I'll look a bit deeper into the loops and play around with it as soon as i find some spare time for it :)

maximsch2 · 2016-06-06T01:34:13Z

I have an OpenMP based version here: https://github.com/maximsch2/bhtsne. I don't like the binary file interface, so I'm also modifying it to build as a shared library and expose a simple C API.
An extra pair of eyes is always useful when writing parallel code, so if anyone else is interested, go ahead and try it out!

lvdmaaten · 2016-06-08T01:24:01Z

I'm no OpenMP expert, but it looks good to me. What kind of speed-ups are you seeing compared to the non-OpenMP code?

maximsch2 · 2016-06-08T23:26:58Z

I haven't benchmarked it on big datasets yet, but I think I get around 1.3-1.5x on two cores on a smallish dataset (takes a couple of seconds to build).
My main motivation was integrating it into visualization system where I wanted to generate t-SNE graphs on the fly. I think making it available as a library (as opposed to writing and reading files), and encoding output dimension as a template parameter (allowing compiler to do less allocation and also hopefully unroll all those loops along dimensions, especially when out_dims=2,3) gave me at least 3-4x improvement in performance.

lvdmaaten · 2016-06-09T01:19:58Z

Nice! In an earlier version of the code, I had hardcoded the output dimension. Indeed, that was a lot faster than the version that is currently in the repo.

iraykhel · 2016-08-01T20:42:15Z

@maximsch2
Would it be complicated to provide Windows build support for your OpenMP implementation? Thanks :)

maximsch2 · 2016-08-01T20:51:34Z

I don't have an easy access to a Windows machine, but I don't think there is anything unix-specific that I've added there. I think MinGW supports OpenMP, so you should be able to build it using gcc just like you would do on Linux.
Are you having some specific issues with Windows?

EDIT: I've just checked and apparently there is even a description of how to build it on Windows. I haven't updated the Makefile.win though, so it might be a bit broken... Can you try just building tsne_bin.cpp using Visual Studio to get a binary? Just that file, no reason to add anything else to it.

EDIT2: I've pushed a version with updated Makefile.win, which has a higher chance of working.

maximsch2 · 2016-08-01T21:38:21Z

@lvdmaaten
Answering your prior question about speed up, I went ahead and benchmarked it on a 25000x50 dataset, on quad core CPU (with HT, so it presents itself as 8-core). Results:

OMP_NUM_THREADS=1
real    2m23.372s
user    2m23.269s
sys 0m0.091s

OMP_NUM_THREADS=2
real    1m49.877s
user    2m31.483s
sys 0m0.103s

OMP_NUM_THREADS=4
real    1m27.637s
user    2m34.543s
sys 0m0.139s

OMP_NUM_THREADS=8
real    1m24.935s
user    3m39.806s
sys 0m0.159s

This is a total time including reading a file, building a tree (currently not paralellized and takes around 25 seconds in this case) and doing 1000 iterations of embedding learning.
Speed up is not perfect, but some scaling is there.

iraykhel · 2016-08-01T22:14:14Z

Thanks, seems to be working :)

lvdmaaten · 2016-08-01T22:25:40Z

Nice!

baobabKoodaa · 2016-09-20T17:20:59Z

I was able to get this version working on Windows, but the multicore version by maximsch2 is just returning with a "non-zero return code" pretty fast. It's not giving more information even though verbose=True.

maximsch2 · 2016-09-20T18:12:52Z

@baobabKoodaa If you want, I can try running your script/data here on Linux to see if this is a Windows-specific issue.

baobabKoodaa · 2016-10-01T21:09:47Z

@maximsch2 Thank you for the idea and for the kind offer. I will try running it on a Linux machine at some point, for now I can make due with the single core version.

kylemcdonald · 2017-03-14T01:06:18Z

Since it hasn't been mentioned yet, see https://github.com/DmitryUlyanov/Multicore-TSNE

see: lvdmaaten/bhtsne#18. These changes are inspired by https://github.com/maximsch2/bhtsne. This approach was selected as it requires minimal changes to parallelize the algorithm. In particular, these changes correspond roughly to the changes in maximsch2/bhtsne@08d8a2a

rappdw · 2017-04-03T20:32:30Z

Based on the above discussion, we've made some modifications to the code on this fork and see some significant performance improvements when running on multiple cores. (We've documented the performance tests and resultant improvement here.

bartimus9 mentioned this issue Aug 8, 2017

Results differ from scikit-learn implementation DmitryUlyanov/Multicore-TSNE#8

Open

kahilah mentioned this issue Sep 6, 2018

Performance difference to the old version #75

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

running on multiple cores? #18

running on multiple cores? #18

joernroeder commented Apr 13, 2016

lvdmaaten commented Apr 14, 2016

joernroeder commented Apr 16, 2016

maximsch2 commented Jun 6, 2016

lvdmaaten commented Jun 8, 2016

maximsch2 commented Jun 8, 2016

lvdmaaten commented Jun 9, 2016

iraykhel commented Aug 1, 2016

maximsch2 commented Aug 1, 2016 •

edited

Loading

maximsch2 commented Aug 1, 2016

iraykhel commented Aug 1, 2016

lvdmaaten commented Aug 1, 2016

baobabKoodaa commented Sep 20, 2016

maximsch2 commented Sep 20, 2016

baobabKoodaa commented Oct 1, 2016

kylemcdonald commented Mar 14, 2017

rappdw commented Apr 3, 2017

running on multiple cores? #18

running on multiple cores? #18

Comments

joernroeder commented Apr 13, 2016

lvdmaaten commented Apr 14, 2016

joernroeder commented Apr 16, 2016

maximsch2 commented Jun 6, 2016

lvdmaaten commented Jun 8, 2016

maximsch2 commented Jun 8, 2016

lvdmaaten commented Jun 9, 2016

iraykhel commented Aug 1, 2016

maximsch2 commented Aug 1, 2016 • edited Loading

maximsch2 commented Aug 1, 2016

iraykhel commented Aug 1, 2016

lvdmaaten commented Aug 1, 2016

baobabKoodaa commented Sep 20, 2016

maximsch2 commented Sep 20, 2016

baobabKoodaa commented Oct 1, 2016

kylemcdonald commented Mar 14, 2017

rappdw commented Apr 3, 2017

maximsch2 commented Aug 1, 2016 •

edited

Loading