Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running on multiple cores? #18

Open
joernroeder opened this issue Apr 13, 2016 · 16 comments
Open

running on multiple cores? #18

joernroeder opened this issue Apr 13, 2016 · 16 comments

Comments

@joernroeder
Copy link

Hey,
it's more a question than an actual issue: I'm mapping a dataset with 32dims x 900000items with tsne on a multi-core machine but as tsne is single threaded i'm just using one core. Do you have any tipps or tricks how i can split the dataset to parallelize computation?
thanks in advance!

@lvdmaaten
Copy link
Owner

There are basically two important loops that should be straightforward to parallelize:

https://github.com/lvdmaaten/bhtsne/blob/master/sptree.cpp#L385
https://github.com/lvdmaaten/bhtsne/blob/master/tsne.cpp#L215

The first loop is actually embarrassingly parallel, so it should be completely trivial. The second loop may be somewhat trickier because each iteration may access the same nodes of the tree, so it is somewhat less predictable what speedups you can get there.

@joernroeder
Copy link
Author

@lvdmaaten thanks for the infos. I'll look a bit deeper into the loops and play around with it as soon as i find some spare time for it :)

@maximsch2
Copy link

I have an OpenMP based version here: https://github.com/maximsch2/bhtsne. I don't like the binary file interface, so I'm also modifying it to build as a shared library and expose a simple C API.
An extra pair of eyes is always useful when writing parallel code, so if anyone else is interested, go ahead and try it out!

@lvdmaaten
Copy link
Owner

I'm no OpenMP expert, but it looks good to me. What kind of speed-ups are you seeing compared to the non-OpenMP code?

@maximsch2
Copy link

I haven't benchmarked it on big datasets yet, but I think I get around 1.3-1.5x on two cores on a smallish dataset (takes a couple of seconds to build).
My main motivation was integrating it into visualization system where I wanted to generate t-SNE graphs on the fly. I think making it available as a library (as opposed to writing and reading files), and encoding output dimension as a template parameter (allowing compiler to do less allocation and also hopefully unroll all those loops along dimensions, especially when out_dims=2,3) gave me at least 3-4x improvement in performance.

@lvdmaaten
Copy link
Owner

Nice! In an earlier version of the code, I had hardcoded the output dimension. Indeed, that was a lot faster than the version that is currently in the repo.

@iraykhel
Copy link

iraykhel commented Aug 1, 2016

@maximsch2
Would it be complicated to provide Windows build support for your OpenMP implementation? Thanks :)

@maximsch2
Copy link

maximsch2 commented Aug 1, 2016

I don't have an easy access to a Windows machine, but I don't think there is anything unix-specific that I've added there. I think MinGW supports OpenMP, so you should be able to build it using gcc just like you would do on Linux.
Are you having some specific issues with Windows?

EDIT: I've just checked and apparently there is even a description of how to build it on Windows. I haven't updated the Makefile.win though, so it might be a bit broken... Can you try just building tsne_bin.cpp using Visual Studio to get a binary? Just that file, no reason to add anything else to it.

EDIT2: I've pushed a version with updated Makefile.win, which has a higher chance of working.

@maximsch2
Copy link

@lvdmaaten
Answering your prior question about speed up, I went ahead and benchmarked it on a 25000x50 dataset, on quad core CPU (with HT, so it presents itself as 8-core). Results:

OMP_NUM_THREADS=1
real    2m23.372s
user    2m23.269s
sys 0m0.091s

OMP_NUM_THREADS=2
real    1m49.877s
user    2m31.483s
sys 0m0.103s

OMP_NUM_THREADS=4
real    1m27.637s
user    2m34.543s
sys 0m0.139s

OMP_NUM_THREADS=8
real    1m24.935s
user    3m39.806s
sys 0m0.159s

This is a total time including reading a file, building a tree (currently not paralellized and takes around 25 seconds in this case) and doing 1000 iterations of embedding learning.
Speed up is not perfect, but some scaling is there.

@iraykhel
Copy link

iraykhel commented Aug 1, 2016

Thanks, seems to be working :)

@lvdmaaten
Copy link
Owner

Nice!

@baobabKoodaa
Copy link

I was able to get this version working on Windows, but the multicore version by maximsch2 is just returning with a "non-zero return code" pretty fast. It's not giving more information even though verbose=True.

@maximsch2
Copy link

@baobabKoodaa If you want, I can try running your script/data here on Linux to see if this is a Windows-specific issue.

@baobabKoodaa
Copy link

@maximsch2 Thank you for the idea and for the kind offer. I will try running it on a Linux machine at some point, for now I can make due with the single core version.

@kylemcdonald
Copy link

Since it hasn't been mentioned yet, see https://github.com/DmitryUlyanov/Multicore-TSNE

rappdw pushed a commit to resero-labs/tsne that referenced this issue Mar 30, 2017
see: lvdmaaten/bhtsne#18. These changes are inspired by https://github.com/maximsch2/bhtsne. This approach was selected as it requires minimal changes to parallelize the algorithm.

In particular, these changes correspond roughly to the changes in maximsch2/bhtsne@08d8a2a
@rappdw
Copy link

rappdw commented Apr 3, 2017

Based on the above discussion, we've made some modifications to the code on this fork and see some significant performance improvements when running on multiple cores. (We've documented the performance tests and resultant improvement here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants