-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gensim.models.LDAmodel producing NaN & same words in each topic #2115
Comments
I also encountered this issue. I believe it is a bug introduced in a recent version of Gensim. Downgrading to gensim 3.1.0 solved the problem for me. |
Hello all I have also encountered the same problem, I am using gensim 3.5.0. I was using 30 passes in the training, but when I changed to the passes to 1, I got normal Topics!! and only the divide by zero warning. Best Regards |
Thanks for report @czhao028, can you please share a small reproducible example (now I can't run your code, because this isn't complete, data is missing)? |
I can't give you a sample of my data because it's confidential data, but phrases_and_tokens["Phrases"] is a pd.Series object containing rows of keywords created by this portion of the code:
after reviewing the tokenize method, it's outdated so I've included the most recent version below: where token_helper is essentially the first line of code I mentioned earlier in this comment |
@czhao028 that's really sad because if we can't reproduce an issue - we have no chance to fix it, can you try to reproduce this error with a publicly available dataset please? |
@RaniemAR you wanna jump in here? |
@czhao028 ping |
I do have the same problem. For me as soon as I try 70 or more topics, I only get NaN. I tried it on two different computers and I tried a lot of combinations between gensim versions and numpy version. Nothing helped. |
Hi @snollygoster123123, can you give more information (exact code with dataset, python/os/gensim version), we need to reproduce this issue first. |
Hi @menshikh-iv, I solved the problem by taking the singlecore LDA Model. My Dataset consists of 60k Documents (Each approximately as long as a Wikipedia article ). lda = gensim.models.ldamulticore.LdaMulticore(corpus,
id2word=dictionary, num_topics=80, chunksize=1800, passes=20,
workers=1, eval_every=1, iterations=1000) I think my post is wrong here in this issue, because OP is using single core. If you want to, you can delete my post or move it. |
@snollygoster123123 can you share |
@menshikh-iv Corpus is 170MB. The only way I had to upload it was uploaded. |
@menshikh-iv if downgrading to Gensim 3.1.0 helps like @groceryheist says, it must be an issue with the recent additions. IIRC, there was some PR that reimplemented parts of LDA in C/Cython, right? Maybe it used the wrong precision (floats, single precision)? If the error is due to such numeric issues, I'm thinking it's possible it only manifests itself on larger datasets, and so our unit tests didn't catch it. |
If you want to reproduce the error: I get it when I run the code here: ...and remove: There's a link on that page to the github that has the code and corpus. |
@piskvorky I think you are right, this is definitely a numeric issue, also, can be related with #1927 |
@csmyth76 @snollygoster123123 @RaniemAR @czhao028 any appetite for looking into this and fixing the numerical bug? I don't know when we'll get to this ourselves, so help would be welcome. It looks like a serious issue with a potentially simple fix. |
I just got the same problem, out of the blue, running gensim version 3.4.0.
I have run the same task without problems with slightly different versions of the corpus. I do not really understand the code enough to be of much help but would a simple Apparently expElogbeta is set to |
This closed issue appears to be about the same problem and may contain relevant information: #217 |
OK, I checked and in my case there are many values in Now, I do not know exactly what the implications should be if some sstats are zero, but I think they should definitely not have an influence on the topic diff like this, but maybe also not on other locations where we get +/- Inf or NaN because of those zeroes? The code appears to alternate between calculating the exp and log (which is -Inf for 0) quite frequently, and the digamma function for the dirichlet expectation (which is Inf for 0). Maybe there is a strategy to correctly handle the calculation of these functions for values which are ultimately coming from those zero counts? |
I find a solution to solve this problem. this eps is too small. When I increase it, non disappear. create a file: from gensim.models.ldamodel import *
DTYPE_TO_EPS = {
np.float16: 1e-5,
np.float32: 1e-25, # <<<<=========== THE VALUE I CHANGE ===========
np.float64: 1e-100,
}
def inference(self, chunk, collect_sstats=False):
try:
len(chunk)
except TypeError:
# convert iterators/generators to plain list, so we have len() etc.
chunk = list(chunk)
if len(chunk) > 1:
logger.debug("performing inference on a chunk of %i documents", len(chunk))
# Initialize the variational distribution q(theta|gamma) for the chunk
gamma = self.random_state.gamma(100., 1. / 100., (len(chunk), self.num_topics)).astype(self.dtype, copy=False)
Elogtheta = dirichlet_expectation(gamma)
expElogtheta = np.exp(Elogtheta)
assert Elogtheta.dtype == self.dtype
assert expElogtheta.dtype == self.dtype
if collect_sstats:
sstats = np.zeros_like(self.expElogbeta, dtype=self.dtype)
else:
sstats = None
converged = 0
for d, doc in enumerate(chunk):
if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
# make sure the term IDs are ints, otherwise np will get upset
ids = [int(idx) for idx, _ in doc]
else:
ids = [idx for idx, _ in doc]
cts = np.array([cnt for _, cnt in doc], dtype=self.dtype)
gammad = gamma[d, :]
Elogthetad = Elogtheta[d, :]
expElogthetad = expElogtheta[d, :]
expElogbetad = self.expElogbeta[:, ids]
# The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_w.
# phinorm is the normalizer.
# TODO treat zeros explicitly, instead of adding epsilon?
eps = DTYPE_TO_EPS[self.dtype]
phinorm = np.dot(expElogthetad, expElogbetad) + eps
# Iterate between gamma and phi until convergence
for _ in xrange(self.iterations):
lastgamma = gammad
gammad = self.alpha + expElogthetad * np.dot(cts / phinorm, expElogbetad.T)
Elogthetad = dirichlet_expectation(gammad)
expElogthetad = np.exp(Elogthetad)
phinorm = np.dot(expElogthetad, expElogbetad) + eps
# If gamma hasn't changed much, we're done.
meanchange = mean_absolute_difference(gammad, lastgamma)
if meanchange < self.gamma_threshold:
converged += 1
break
gamma[d, :] = gammad
assert gammad.dtype == self.dtype
if collect_sstats:
# Contribution of document d to the expected sufficient
# statistics for the M step.
sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)
if len(chunk) > 1:
logger.debug("%i/%i documents converged within %i iterations", converged, len(chunk), self.iterations)
if collect_sstats:
# This step finishes computing the sufficient statistics for the
# M step, so that
# sstats[k, w] = \sum_d n_{dw} * phi_{dwk}
# = \sum_d n_{dw} * exp{Elogtheta_{dk} + Elogbeta_{kw}} / phinorm_{dw}.
sstats *= self.expElogbeta
assert sstats.dtype == self.dtype
assert gamma.dtype == self.dtype
return gamma, sstats
def modify_lda_inference():
LdaModel.inference = inference Usage:
|
Nice catch, thanks @zkwhandan 👍 |
I also meet the problem when I ran an LDA from Gensim library. Here is the error:
After going through the answers mentioned above, I tried to update my Numpy version and Gensim version to the updated one. However, the problem is still here. My dataset includes about 10,000 tweets. Btw, I tried to use 5 tweets, it seems no problem in generating the topics. Hope to get a response soon. Thank you! |
I'm encountering the same issue when using a large number of topics (+200) |
Description
Here is a brief introduction on StackOverflow; I thought I'd post this here too because the other StackOverflow question with the exact same issue as mine hasn't gotten even a single response in 2 weeks.
Link: https://stackoverflow.com/questions/51142294/gensim-ldamodel-error-nan-and-all-topics-the-same
Steps/Code/Corpus to Reproduce
-->
Expected Results
Something like this:
Actual Results
Please paste or specifically describe the actual output or traceback. -->
Versions
I think it probably has to do with a numpy issue but all my attempts to upgrade and reinstall have been fruitless. Another coworker ran this on his computer and it worked just fine. Probably a recent update in numpy has caused this recent issue (there's one other person who posted it on StackOverflow 2 weeks ago) but uninstalling packages has broken so many things that I don't want to take the risk. However, I am trying to learn how to use virtual environments and see if I can test out different versions of numpy with this code. Thank you! Hope to get a response soon.
The text was updated successfully, but these errors were encountered: