Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cython version for "hot" functions from gensim.models.LdaModel #1767

Merged
merged 39 commits into from
Feb 23, 2018

Conversation

arlenk
Copy link
Contributor

@arlenk arlenk commented Dec 7, 2017

Attempt at moving some of the hotspots for ldamodel into a cython module. From some profiling, most of the time in ldamodel.LdaModel is spent in the following few spots:

dirichlet_expectation (specifically the call to psi())
calculating the mean absolute difference to determine when gamma has converged
logsumexp (even with the inline version)

I broke these functions out into a cython module. For now, I just created a fastldamodel.py (a copy of ldamodel.py but with the cython functions above) module to see if there is any interest in pursuing this. Based on some profiling on my pc (attached), the speed up is about 100% (meaning the new code takes half the time -- though this is dependent on number of topics, number of passes, etc.).

I also noticed that the current ldamodel code uses float32 datatypes, but the user can change this if desired. If we go down the cython route, we'd have to decide if we want to allow the user to change the datatype as this would require different cython functions for each data type (I believe). Alternatively, we could use the cython code when the datatype is float32 (and perhaps float64) and fall back to the slower versions otherwise.

@arlenk arlenk changed the title partially cython-ized version of ldamodel [WIP] partially cython-ized version of ldamodel Dec 7, 2017
@arlenk
Copy link
Contributor Author

arlenk commented Dec 13, 2017

fyi, I reworked this change to mirror the implementation of word2vec_inner and doc2vec_inner.

The cython functions now live in ldamodel_inner.pyx and LdaModel tries to use the fast cython versions if possible, but falls back to the "slower" alternatives if it cant. I also explicitly added separate cython functions for float32 and float64, so the faster code should work with both of these dtypes now (I don't know how to handle float16s nicely in the cython code so for now that falls back to the slower code as well).

I used cython's fused datatypes to make the cython code a little nicer, but the code still has a lot of duplication in it. I'm not familiar with other templating methods in cython, but I'm sure there must be a way to make the functions generic without requiring so much duplication.

The _attach_inner_methods() function in LdaModel tries to pick up the right methods to use based on the availability of ldamodel_inner and the dtype. There's probably a nicer way of doing this as well.

I haven't checked all the edge cases yet, but my simple tests so far seem to work.

This implementation has the benefit of allowing the multicore and distributed versions to use the faster cython code as well.

I also left in the older "fastldamodel.py" files just to make comparisions between the two implementations easier.

@arlenk arlenk closed this Dec 13, 2017
@arlenk arlenk reopened this Dec 13, 2017
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @arlenk!

Can you show benchmark results please (LdaModel/LdaMulticore with numpy/scipy stuff VS LdaModel/LdaMulticore with cythonized functions)

Also, please add tests for your cython functions (we should trust that all works correctly)

@@ -42,7 +42,7 @@
from collections import defaultdict

from gensim import interfaces, utils, matutils
from gensim.matutils import dirichlet_expectation
from gensim.matutils import dirichlet_expectation as dirichlet_expectation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why import X as X ?


try:
# try to load fast, cythonized code if possible
from gensim.models.ldamodel_inner import dirichlet_expectation_1d_f32, dirichlet_expectation_1d_f64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @piskvorky, strict typification improve performance, but not much, wdyt about distinct function for each type?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you. I'd be mindful of the added complexity -- what sort of performance difference are we talking about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arlenk can you compare these variants?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will take me a few days but I will include some benchmark results.

The typed versions (_f32 and _f64) are not entirely for performance reasons though. The c code needs to be different for float vs. doubles. I used cython's fused types to remove some of the duplication, but I don't think there's a way to have python call a single function that will dispatch (statically) to the correct float vs. double cython function. We could check the dtypes at run time, but that does impose a pretty large penalty given how often these functions are called (eg, dirichlet_expectation is called millions of times even for a small corpus).

I can include a benchmark with the run-time dtype check to see how big the impact is though.

try:
# try to load fast, cythonized code if possible
from gensim.models.ldamodel_inner import dirichlet_expectation_1d_f32, dirichlet_expectation_1d_f64
from gensim.models.ldamodel_inner import dirichlet_expectation_2d_f32, dirichlet_expectation_2d_f64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can import all this stuff with one import with line breaks (no needed to repeat from X import several times)

"""
return np.mean(np.abs(a - b))

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to move this section to head of file (after import section)

Copy link
Contributor Author

@arlenk arlenk Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how to restructure this yet. The except clause needs to happen after the "slow" versions of logsumexp and mean_absolute_difference are defined, so that's why the code is where it is right now. I initially included the definition of the slow versions in the except block as well (mirroring what word2vec does), but that breaks if anyone uses the float16 dtype which requires the slow functions even if the cython versions exist (since the cython versions can only handle float32 and float 64).

might make things easier just to move the "inner" functions (logsumexp, dirichlet_expectation, and mean_abs_diff) into their own python module and have that module import the cython versions if available.

Open to any other suggestions as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's leave it for now as is

mean_absolute_difference_f32 = mean_absolute_difference_f64 = mean_absolute_difference


def _attach_inner_methods(self, dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to define needed functions on module level (not "patch" LdaModel at runtime), this allows avoiding a problem with backward compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was my original intention (and that's what the original "fastldamodel.py" code did), but I am not sure how to make this work nicely with LdaModel since a user could create multiple models with different dtypes. Eg,

model0 = LdaModel(corpus=corpus, id2word=corpus.id2word, num_topics=10, dtype=np.float16)
model1 = LdaModel(corpus=corpus, id2word=corpus.id2word, num_topics=10, dtype=np.float32)
model2 = LdaModel(corpus=corpus, id2word=corpus.id2word, num_topics=10, dtype=np.float64)

each of those objects would need to use different implementations of the inner functions (unless we dispatched at run-time based on dtypes, but as mentioned before, i think that will have a decent impact on speed).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, you depend on dtype of LDA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, you are right, dtype is a really fresh feature (for LDA), I forgot about it.
Ok, probably that's variant is OK (another variant - add this as method to LDA and call in init)

Copy link
Owner

@piskvorky piskvorky Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should decide what dtype makes the most sense, and stick with it (drop support for all other dtypes).

It's not a parameter we can expose and ask users to choose for themselves (especially since we give no guidance, no pros/cons/recommendations about the implications of this choice).

Copy link
Contributor

@menshikh-iv menshikh-iv Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky probably float32 should be the best choice (by speed, memory consumption and quality)

Copy link
Owner

@piskvorky piskvorky Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but would like to see some numeric arguments (benchmarks, model quality & performance evals) to back that up.

Unless the other dtypes offer some dramatically different tradeoffs, I see no reason to complicate the code and support them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing the other dtypes would certainly make this change easier.

I was running some tests with the different dtypes, and float16 (one of the allowed options in the existing code) seems to over/underflow with some reasonable input data. For example the following code (I left out the imports) will crash with an overflow error:

np.seterr(all='raise')

text = api.load('text8')
id2word = gensim.corpora.Dictionary(text)
corpus = [ id2word.doc2bow(t) for t in text ]

LdaModel(corpus=corpus, id2word=id2word, num_topics=10, dtype=np.float16)

@@ -0,0 +1,256 @@
from __future__ import division
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is needed right now (after last changes where you move all "slowest" function to cython)?

Copy link
Contributor

@menshikh-iv menshikh-iv Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, .pyd file (windows compiled) should not be here (only pyx and c)

Copy link
Contributor Author

@arlenk arlenk Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, not needed. Left it in just for comparison, but removed it now.

setup.py Outdated
@@ -13,7 +13,7 @@
import os
import sys
import warnings

import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really wired, please don't use numpy in setup.py (by this reason all CI broken)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok removed this. Didn't realize that you could include the numpy header files for the cython code without importing numpy directly.

@arlenk
Copy link
Contributor Author

arlenk commented Dec 18, 2017

Can you show benchmark results please (LdaModel/LdaMulticore with numpy/scipy stuff VS LdaModel/LdaMulticore with cythonized functions)

Here are the results for a couple different settings (only using LdaModel for now, not LdaMulticore):
https://gist.github.com/arlenk/ea745c1d05e323d87df96c0e6b490d05

The cython code takes around 60% of the time as the current code.

Here's the code used for the benchmarks:
https://gist.github.com/arlenk/ec02cce1fe571b4c05ef3853c49289a3

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Dec 18, 2017

@arlenk your results look great! thanks for benchmarking (but please post it as a table in this PR for history).

Can you make the same benchmark for the different dtype case?

Also, don't forget about tests (this is missing here now).

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 8, 2018

@arlenk big thanks for the concrete example 👍 , I reproduced this problem with python2 on Linux.
@piskvorky apparently we should use float64 here (by "underflow" of float32)

Stacktrace

FloatingPointError                        Traceback (most recent call last)
<ipython-input-1-e77d33ef3be3> in <module>()
     12 corpus = [ id2word.doc2bow(t) for t in text ]
     13 
---> 14 LdaModel(corpus=corpus, id2word=id2word, num_topics=100, dtype=np.float32)

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/ldamodel.pyc in __init__(self, corpus, num_topics, id2word, distributed, chunksize, passes, update_every, alpha, eta, decay, offset, eval_every, iterations, gamma_threshold, minimum_probability, random_state, ns_conf, minimum_phi_value, per_word_topics, callbacks, dtype)
    396         if corpus is not None:
    397             use_numpy = self.dispatcher is not None
--> 398             self.update(corpus, chunks_as_numpy=use_numpy)
    399 
    400     def init_dir_prior(self, prior, name):

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/ldamodel.pyc in update(self, corpus, chunksize, decay, offset, passes, update_every, eval_every, iterations, gamma_threshold, chunks_as_numpy)
    730 
    731                 if eval_every and ((reallen == lencorpus) or ((chunk_no + 1) % (eval_every * self.numworkers) == 0)):
--> 732                     self.log_perplexity(chunk, total_docs=lencorpus)
    733 
    734                 if self.dispatcher:

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/ldamodel.pyc in log_perplexity(self, chunk, total_docs)
    604         corpus_words = sum(cnt for document in chunk for _, cnt in document)
    605         subsample_ratio = 1.0 * total_docs / len(chunk)
--> 606         perwordbound = self.bound(chunk, subsample_ratio=subsample_ratio) / (subsample_ratio * corpus_words)
    607         logger.info(
    608             "%.3f per-word bound, %.1f perplexity estimate based on a held-out corpus of %i documents with %i words",

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/ldamodel.pyc in bound(self, corpus, gamma, subsample_ratio)
    842                 logger.debug("bound: at document #%i", d)
    843             if gamma is None:
--> 844                 gammad, _ = self.inference([doc])
    845             else:
    846                 gammad = gamma[d]

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/ldamodel.pyc in inference(self, chunk, collect_sstats)
    522                 gammad = self.alpha + expElogthetad * np.dot(cts / phinorm, expElogbetad.T)
    523                 Elogthetad = dirichlet_expectation(gammad)
--> 524                 expElogthetad = np.exp(Elogthetad)
    525                 phinorm = np.dot(expElogthetad, expElogbetad) + eps
    526                 # If gamma hasn't changed much, we're done.

FloatingPointError: underflow encountered in exp

@menshikh-iv
Copy link
Contributor

@arlenk generally, we need to investigate why "underflow" happens, fix it and stay only this data type. Thanks for your great work!

Have you any ideas how to understand, why underflow problem happens? Probably, this is a bad design of the algorithm, but I have no idea how to debug this.

@arlenk
Copy link
Contributor Author

arlenk commented Jan 15, 2018

@arlenk generally, we need to investigate why "underflow" happens, fix it and stay only this data type. Thanks for your great work!

Have you any ideas how to understand, why underflow problem happens? Probably, this is a bad design of the algorithm, but I have no idea how to debug this.

I don't think it's an algorithm issue, but I will try digging into this some more.

In this particular case, I don't think it's necessarily an issue. The underflowed value is later used in a dot product, so the term involving the underflowed value is going to be ~zero in any case. If all the values in the dot product were underflows, that would be an issue. But I don't think having a few underflows will break the algorithm.

Having said that, it still may make sense to default to float64 to avoid underflows (I have not run into any yet with float64 dtype). Float32 does save memory, of course, but that could still be left as an option for anyone who is really running low on memory.

@menshikh-iv
Copy link
Contributor

Thank you @arlenk!
@piskvorky tell us where to "dig" further, we have no ideas.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @arlenk 🔥

setup.py Outdated
@@ -258,7 +257,10 @@ def finalize_options(self):
include_dirs=[model_dir]),
Extension('gensim.models.fasttext_inner',
sources=['./gensim/models/fasttext_inner.c'],
include_dirs=[model_dir])
include_dirs=[model_dir]),
Extension('gensim.models.ldamodel_inner',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to modify MANIFEST.ini in root of repo (add your files)

setup.py Outdated
include_dirs=[model_dir])
include_dirs=[model_dir]),
Extension('gensim.models.ldamodel_inner',
sources=['./gensim/models/ldamodel_inner.c'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better name for it is _matutils (because this isn't model at all, this only concrete math functions)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, move this file on level of matutils.py ? wdyt?

@@ -56,32 +55,26 @@
np.float64: 1e-100,
}

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can move this to matutils too (try/exc section)

@menshikh-iv
Copy link
Contributor

@arlenk please resolve merge conflict & make suggested fixes.

I think we'll fix underflow problem not soon, what's prevents us now to merge this change?

@arlenk
Copy link
Contributor Author

arlenk commented Feb 23, 2018

@arlenk please resolve merge conflict & make suggested fixes.

I think we'll fix underflow problem not soon, what's prevents us now to merge this change?

Ok, I made the suggested changes.

I also removed the use of dirichlet_expectation_1d and _2d and instead just added a cython version of dirichlet_expectation() the calls the _1d or _2d based on the input shape. Turns out, after some profiling, the added ndim check add less than a few percent of overhead and having a single dirichlet_expectation function simplifies the code.

I think this also means that hdpmodel and atmodel should benefit from the changes as well (as they call matutils.dirichlet_expectation)

@menshikh-iv
Copy link
Contributor

@arlenk can you fix tests please (you forgot to change tests accoding to remove _1d, _2d change).

@menshikh-iv
Copy link
Contributor

@arlenk UPD, I'll do it myself, don't worry (anyway I need to fix docstrings)

@menshikh-iv menshikh-iv changed the title [WIP] partially cython-ized version of ldamodel Add cython version for "hot" functions from gensim.models.LdaModel Feb 23, 2018
@menshikh-iv
Copy link
Contributor

@arlenk beautiful! You made a very dotty and important change, which seriously affects the performance of Lda (and related models, that used LDA as the base model). Both of your PRs (MmReader and this one) will be our main features in next release 💥

Do not want to continue gensim cythonization? We would be very happy if you have time and wish for it ❤️.

@menshikh-iv menshikh-iv merged commit a54e336 into piskvorky:develop Feb 23, 2018
@piskvorky
Copy link
Owner

Awesome job! Thanks so much @arlenk .

@arlenk arlenk deleted the cython branch February 26, 2018 01:34
@arlenk
Copy link
Contributor Author

arlenk commented Feb 26, 2018

@arlenk beautiful! You made a very dotty and important change, which seriously affects the performance of Lda (and related models, that used LDA as the base model). Both of your PRs (MmReader and this one) will be our main features in next release 💥

Do not want to continue gensim cythonization? We would be very happy if you have time and wish for it ❤️.

good to hear. LDA model is what I am most familiar with, but if there are other bottlenecks that have come up on the past, let me know and I am happy to take a look.

@menshikh-iv
Copy link
Contributor

@arlenk I have several optimization tasks

  • (simple) New binary corpus format #1697 new binary corpus. Idea - make one more corpus in simple binary format. This should be fast to read/write (reading speed is more important here) and compact (as the file).
  • (hard) Multi-stream API, this is really big, but bottleneck for many models (especially for *2vec), look at related issues (that mentioned in description) for understanding, what's happens here

The last task need a really long time, but first looks simple and does not require so much time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants