-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redundant get_Elogbeta calls in LdaModel #2051
Comments
From a quick look, I think you are right. Out of curiosity, what version of gensim are you using and what parameters are you passing LdaModel? I am asking because in my use cases, get_Elogbeta takes about 5% of the total run time, which isn't negligible, but it seems like get_Elogbeta is taking up a much larger percent of the total run time for you (if I am reading your profile chart correctly). |
I forget which version of gensim I was when I performed the performance but it was in the 3.x release. I'm currently using 3.4. I believe the performance will depend on the number of topics and number of terms (size of vocab), this may be run over my test case of 1.3 million documents with 55k unique terms for 16 topics, Here is a patch to modify the LdaModel object. I have verified that self.state.get_Elogbeta() is stateless, so that the second call in the original code is getting the same value as the first call. With this patch, the corpus needs to be passed via the update vs the init. lda = LdaModel(id2word=master_dictionary, num_topics=num_of_topics, passes=1)
import numpy as np
def sync_state(self, elogbeta=None):
# change start
if elogbeta is None:
elogbeta = self.state.get_Elogbeta()
# change end
self.expElogbeta = np.exp(elogbeta)
assert self.expElogbeta.dtype == self.dtype
def do_mstep(self, rho, other, extra_pass=False):
"""
M step: use linear interpolation between the existing topics and
collected sufficient statistics in `other` to update the topics.
"""
logger.debug("updating topics")
# update self with the new blend; also keep track of how much did
# the topics change through this update, to assess convergence
diff = np.log(self.expElogbeta)
self.state.blend(rho, other)
elogbeta = self.state.get_Elogbeta()
# change start
diff -= elogbeta
self.sync_state(elogbeta)
# change end
# print out some debug info at the end of each EM iteration
self.print_topics(5)
logger.info("topic diff=%f, rho=%f",
np.mean(np.abs(diff)), rho)
if self.optimize_eta:
self.update_eta(self.state.get_lambda(), rho)
if not extra_pass:
# only update if this isn't an additional pass
self.num_updates += other.numdocs
import types
lda.sync_state = types.MethodType(sync_state, lda)
lda.do_mstep = types.MethodType(do_mstep, lda)
lda.update(corpus=corpus) |
In reviewing performance profiling I'm finding that get_Elogbeta is taking some time.
I notice that there are 2 calls to this function, one after the other, LdaModel.do_mstep
The first call is obvious, in LdaModel.do_mstep we have
But do you see the second call? In sync_state we have
The question then becomes does get_Elogbeta perform some side effect that the 2nd call would be different from the first?
Looking at get_Elogbeta
This function doesn't modify state directly.
Looking at dirichlet_expectation python code (easier to read than the cython code).
This doesn't modify the input alpha, and the function psi is from scipy.special.
Thus it appears that the second call get_Elogbeta duplicates the work on the first and can be eliminated.
Is this analysis correct?
Here is a profile chart showing the timing.
This is taken from the overall ldaModel init & update where get_Elogbeta is the two top boxes on the far right over lda'ing a 1.3M corpus.
The text was updated successfully, but these errors were encountered: