Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inline a barebones version of logsumexp for improved performance #1745

Merged
merged 3 commits into from
Nov 30, 2017

Conversation

arlenk
Copy link
Contributor

@arlenk arlenk commented Nov 29, 2017

logsumexp accounts for 50% of the run time of ldamodel. Much
of this time is spent by "robustness" checks performed by
scipy's logsumexp (eg, _asarray_validated, checks for NaNs, etc.).

Removing these checks greatly improves the overall performance
of ldamodel. Eg, run time when fitting a lda model on the
enron dataset (from UCI) decreases from 20-40%.

logsumexp accounts for 50% of the run time of ldamodel.  Much
of this time is spent by "robustness" checks performed by
scipy's logsumexp (eg, _asarray_validated, checks for NaNs, etc.).

Removing these checks greatly improves the overall performance
of ldamodel.  Eg, run time when fitting a lda model on the
enron dataset (from UCI) decreases from 20-40%.
@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 29, 2017

Hi @arlenk, can you attach files from cProfile before/after change + code that you used for profiling?


logger = logging.getLogger('gensim.models.ldamodel')


def logsumexp(x):
"""
barebones log-sum-exp that tries to avoid overflows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv menshikh-iv mentioned this pull request Nov 29, 2017
@arlenk
Copy link
Contributor Author

arlenk commented Nov 30, 2017

can't seem to attach a py file, so here's a zip of the profiling code and the cProfile output. The speed up varies based on the number of passes and other parameters. On my computer, I see speedups of 10-30% on average. Curious if that's specific to my configuration or not.

one other benefit is that this speeds up ldamulticore and ldaseqmodel as well.

profiling.zip

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 30, 2017

I see only 5% speedup (because this method doesn't consume 50% of the time, the slowest method is dirichlet_expectation around 40% of all time), but that's good optimization too:+1:

@arlenk congratz with the first PR!

@menshikh-iv menshikh-iv merged commit 08cfa93 into piskvorky:develop Nov 30, 2017
@piskvorky
Copy link
Owner

piskvorky commented Nov 30, 2017

Rewriting the whole LDA inner loop in C (Cython) would be super cool.

It should be also quite easy to do (unless I'm missing something). I expect the NumPy overhead is massive, will all the checks, dynamic allocations, Python overhead...

@arlenk arlenk deleted the inline-logsumexp branch December 1, 2017 01:39
@arlenk
Copy link
Contributor Author

arlenk commented Dec 5, 2017

Rewriting the whole LDA inner loop in C (Cython) would be super cool.

I took a stab at moving more of the hotspots into cython in the branch below. From some profiling, most of the time spent is in the following few spots:

  • dirichlet_expectation (specifically the call to psi())
  • calculating the mean absolute difference to determine when gamma has converged
  • logsumexp (even with the inline version)

I broke these functions out into a cython module. For now, I just created a fastldamodel.py (a copy of ldamodel.py but with the cython functions above) module to see if there is any interest in pursuing this. Based on some profiling on my pc (attached), the speed up is about 100% (meaning the new code takes half the time -- though this is dependent on number of topics, number of passes, etc.).

I also noticed that the current ldamodel code uses float32 datatypes, but the user can change this if desired. If we go down the cython route, we'd have to decide if we want to allow the user to change the datatype as this would require different cython functions for each data type (I believe). Alternatively, we could use the cython code when the datatype is float32 (and perhaps float64) and fall back to the slower versions otherwise.

https://github.com/arlenk/gensim/tree/cython

profiling_results.zip

@piskvorky
Copy link
Owner

Awesome!

Float32 is the default, that should be our focus (at least to start with). Double precision is not that important.

@menshikh-iv
Copy link
Contributor

@arlenk can you create PR with your cython optimization, it will be great!

@arlenk
Copy link
Contributor Author

arlenk commented Dec 7, 2017

moved to a separate pull request:

#1767

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants