Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gensim.models.LDAmodel producing NaN & same words in each topic #2115

Closed
czhao028 opened this issue Jul 3, 2018 · 23 comments
Closed

gensim.models.LDAmodel producing NaN & same words in each topic #2115

czhao028 opened this issue Jul 3, 2018 · 23 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix

Comments

@czhao028
Copy link

czhao028 commented Jul 3, 2018

Description

Here is a brief introduction on StackOverflow; I thought I'd post this here too because the other StackOverflow question with the exact same issue as mine hasn't gotten even a single response in 2 weeks.

Link: https://stackoverflow.com/questions/51142294/gensim-ldamodel-error-nan-and-all-topics-the-same

Steps/Code/Corpus to Reproduce

#create pandas frame object w/ default rows
def tokenize(pd_object):
    for i, row in pd_object.iterrows():
        id = row["ID"]
        sentences = split_sentences(str(row["Comment"]))
        """ **Time Consuming** """
        tokens =  [[id, sent, gensim.parsing.preprocessing.preprocess_string(sent.lower(), filters=[strip_punctuation,
            strip_multiple_whitespaces, strip_numeric, strip_short, wordnet_stem])] for sent in sentences]
#append tokens to new pandas dataframe object 
def train(pd_object):
    t1 = time.time()
    phrases_and_tokens = tokenize(pd_object)
    bag_of_words = phrases_and_tokens["Tokens"].tolist()
    t2 = time.time()
    print("Time Taken %12f" % (t2-t1))

    bigram = gensim.models.Phrases(bag_of_words, threshold=1)
    bigram_mod = gensim.models.phrases.Phraser(bigram)

    texts = [filter_stop(bigram_mod[t]) for t in bag_of_words]

    id2word = corpora.Dictionary(texts)
    sent_wordfreq = [id2word.doc2bow(sent) for sent in texts]

    lda_model = gensim.models.ldamodel.LdaModel(corpus=sent_wordfreq,
                                                id2word=id2word,
                                                num_topics=5)

    print(lda_model.print_topics())

-->

Expected Results

Something like this:

[(0,
  '0.025*"game" + 0.018*"team" + 0.016*"year" + 0.014*"play" + 0.013*"good" + '
  '0.012*"player" + 0.011*"win" + 0.007*"season" + 0.007*"hockey" + '
  '0.007*"fan"'),
 (1,
  '0.021*"window" + 0.015*"file" + 0.012*"image" + 0.010*"program" + '
  '0.010*"version" + 0.009*"display" + 0.009*"server" + 0.009*"software" + '
  '0.008*"graphic" + 0.008*"application"'),
 (2,
  '0.021*"gun" + 0.019*"state" + 0.016*"law" + 0.010*"people" + 0.008*"case" + '
  '0.008*"crime" + 0.007*"government" + 0.007*"weapon" + 0.007*"police" + '
  '0.006*"firearm"'),
 (3,
  '0.855*"ax" + 0.062*"max" + 0.002*"tm" + 0.002*"qax" + 0.001*"mf" + '
  '0.001*"giz" + 0.001*"_" + 0.001*"ml" + 0.001*"fp" + 0.001*"mr"'),
 (4,
  '0.020*"file" + 0.020*"line" + 0.013*"read" + 0.013*"set" + 0.012*"program" '
  '+ 0.012*"number" + 0.010*"follow" + 0.010*"error" + 0.010*"change" + '
  '0.009*"entry"'),
 (5,
  '0.021*"god" + 0.016*"christian" + 0.008*"religion" + 0.008*"bible" + '
  '0.007*"life" + 0.007*"people" + 0.007*"church" + 0.007*"word" + 0.007*"man" '
  '+ 0.006*"faith"'),
 (..truncated..)]

Actual Results

[(0, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ....
(1, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ...
(2, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ...
(3, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ..)
(4, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ..)]

Please paste or specifically describe the actual output or traceback. -->

Versions

>>> import platform; print(platform.platform())
Darwin-17.6.0-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
Python 3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.14.5
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.1.0
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.4.0
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1

I think it probably has to do with a numpy issue but all my attempts to upgrade and reinstall have been fruitless. Another coworker ran this on his computer and it worked just fine. Probably a recent update in numpy has caused this recent issue (there's one other person who posted it on StackOverflow 2 weeks ago) but uninstalling packages has broken so many things that I don't want to take the risk. However, I am trying to learn how to use virtual environments and see if I can test out different versions of numpy with this code. Thank you! Hope to get a response soon.

@groceryheist
Copy link

I also encountered this issue. I believe it is a bug introduced in a recent version of Gensim. Downgrading to gensim 3.1.0 solved the problem for me.

@piskvorky piskvorky added the bug Issue described a bug label Jul 8, 2018
@RanAR90
Copy link

RanAR90 commented Jul 20, 2018

Hello all

I have also encountered the same problem, I am using gensim 3.5.0.
I have trained couple other models earlier but they were all fine. I have only got this when I was trying to train on a corpus of 100K English wikipedia articles.
I have got this warnings during the training:
RuntimeWarning: divide by zero encountered in log diff = np.log(self.expElogbeta)
RuntimeWarning: overflow encountered in add sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)

I was using 30 passes in the training, but when I changed to the passes to 1, I got normal Topics!! and only the divide by zero warning.
Just thought sharing this might help.

Best Regards

@menshikh-iv
Copy link
Contributor

Thanks for report @czhao028,

can you please share a small reproducible example (now I can't run your code, because this isn't complete, data is missing)?

@menshikh-iv menshikh-iv added the need info Not enough information for reproduce an issue, need more info from author label Jul 31, 2018
@czhao028
Copy link
Author

czhao028 commented Jul 31, 2018

I can't give you a sample of my data because it's confidential data, but phrases_and_tokens["Phrases"] is a pd.Series object containing rows of keywords created by this portion of the code:

gensim.parsing.preprocessing.preprocess_string(sent.lower(), filters=[strip_punctuation, strip_multiple_whitespaces, strip_numeric, strip_short, wordnet_stem] for sent in sentences

after reviewing the tokenize method, it's outdated so I've included the most recent version below:

screen shot 2018-07-31 at 10 19 45 am

where token_helper is essentially the first line of code I mentioned earlier in this comment

@menshikh-iv
Copy link
Contributor

@czhao028 that's really sad because if we can't reproduce an issue - we have no chance to fix it, can you try to reproduce this error with a publicly available dataset please?

@czhao028
Copy link
Author

czhao028 commented Aug 3, 2018

@RaniemAR you wanna jump in here?

@menshikh-iv
Copy link
Contributor

@czhao028 ping

@snollygoster123123
Copy link

I do have the same problem. For me as soon as I try 70 or more topics, I only get NaN. I tried it on two different computers and I tried a lot of combinations between gensim versions and numpy version. Nothing helped.

@menshikh-iv
Copy link
Contributor

Hi @snollygoster123123, can you give more information (exact code with dataset, python/os/gensim version), we need to reproduce this issue first.

@snollygoster123123
Copy link

snollygoster123123 commented Aug 15, 2018

Hi @menshikh-iv, I solved the problem by taking the singlecore LDA Model. My Dataset consists of 60k Documents (Each approximately as long as a Wikipedia article ).
Worked fine with LDA Multicore for 10-60 topics. Anything above will result in only NaNs.
The line for the lda was:

lda = gensim.models.ldamulticore.LdaMulticore(corpus, 
    id2word=dictionary, num_topics=80, chunksize=1800, passes=20, 
    workers=1, eval_every=1, iterations=1000)

I think my post is wrong here in this issue, because OP is using single core. If you want to, you can delete my post or move it.

@menshikh-iv
Copy link
Contributor

@snollygoster123123 can you share corpus please?

@snollygoster123123
Copy link

snollygoster123123 commented Aug 17, 2018

@menshikh-iv Corpus is 170MB. The only way I had to upload it was uploaded.
http://uploaded.net/file/3bzy5v6p
In case you have a better way, please let me know.
Also I do have the same problem now, also for the single core version (whenever I use 80 or more topics). If you are able to generate 80 topics, please let me know.

@piskvorky
Copy link
Owner

piskvorky commented Aug 18, 2018

@menshikh-iv if downgrading to Gensim 3.1.0 helps like @groceryheist says, it must be an issue with the recent additions. IIRC, there was some PR that reimplemented parts of LDA in C/Cython, right?

Maybe it used the wrong precision (floats, single precision)? If the error is due to such numeric issues, I'm thinking it's possible it only manifests itself on larger datasets, and so our unit tests didn't catch it.

@csmyth76
Copy link

csmyth76 commented Sep 16, 2018

If you want to reproduce the error: I get it when I run the code here:
https://datascienceplus.com/topic-modeling-in-python-with-nltk-and-gensim/

...and remove:
if random.random() > .99:

There's a link on that page to the github that has the code and corpus.

@menshikh-iv
Copy link
Contributor

@piskvorky I think you are right, this is definitely a numeric issue, also, can be related with #1927

@piskvorky
Copy link
Owner

piskvorky commented Sep 17, 2018

@csmyth76 @snollygoster123123 @RaniemAR @czhao028 any appetite for looking into this and fixing the numerical bug?

I don't know when we'll get to this ourselves, so help would be welcome. It looks like a serious issue with a potentially simple fix.

@johann-petrak
Copy link
Contributor

I just got the same problem, out of the blue, running gensim version 3.4.0.

anaconda3/lib/python3.6/site-packages/gensim/models/ldamodel.py:775: RuntimeWarning: divide by zero encountered in log
diff = np.log(self.expElogbeta)

I have run the same task without problems with slightly different versions of the corpus.
So it seems there is some very specific situation here which cannot be easily be reproduced by a minimal test case. I sadly also cannot share the corpus as it is licensed.

I do not really understand the code enough to be of much help but would a simple
guard against trying to get log(0) and setting diff to 0 in that case be a workaround here?

Apparently expElogbeta is set to np.exp(self.state.get_Elogbeta()) which means that the result of self.state.get_Elogbeta() must be -Inf which in turn means that dirichlet_expectation(self.get_lambda()) must be -Inf which means that self.get_lambda() must be zero? Not sure how that could ever happen or if my train of thought is wrong here ...

@johann-petrak
Copy link
Contributor

This closed issue appears to be about the same problem and may contain relevant information: #217

@johann-petrak
Copy link
Contributor

OK, I checked and in my case there are many values in self.state.sstats which are zero.
Then self.expElogbeta = np.exp(dirichlet_expectation(self.state.sstats)) and diff = np.log(self.expElogbeta) and then taking the mean of anything that has at least one Inf value in it causes the topic diff to be Inf.

Now, I do not know exactly what the implications should be if some sstats are zero, but I think they should definitely not have an influence on the topic diff like this, but maybe also not on other locations where we get +/- Inf or NaN because of those zeroes? The code appears to alternate between calculating the exp and log (which is -Inf for 0) quite frequently, and the digamma function for the dirichlet expectation (which is Inf for 0). Maybe there is a strategy to correctly handle the calculation of these functions for values which are ultimately coming from those zero counts?

@zkwhandan
Copy link

zkwhandan commented Nov 30, 2018

I find a solution to solve this problem.
At line 666 in ldamodel.py, there is a TODO.
# TODO treat zeros explicitly, instead of adding epsilon?
eps = DTYPE_TO_EPS[self.dtype]
phinorm = np.dot(expElogthetad, expElogbetad) + eps

this eps is too small. When I increase it, non disappear.

create a file:

from gensim.models.ldamodel import *


DTYPE_TO_EPS = {
    np.float16: 1e-5,
    np.float32: 1e-25,      # <<<<=========== THE VALUE I CHANGE ===========
    np.float64: 1e-100,
}


def inference(self, chunk, collect_sstats=False):
    try:
        len(chunk)
    except TypeError:
        # convert iterators/generators to plain list, so we have len() etc.
        chunk = list(chunk)
    if len(chunk) > 1:
        logger.debug("performing inference on a chunk of %i documents", len(chunk))

    # Initialize the variational distribution q(theta|gamma) for the chunk
    gamma = self.random_state.gamma(100., 1. / 100., (len(chunk), self.num_topics)).astype(self.dtype, copy=False)
    Elogtheta = dirichlet_expectation(gamma)
    expElogtheta = np.exp(Elogtheta)

    assert Elogtheta.dtype == self.dtype
    assert expElogtheta.dtype == self.dtype

    if collect_sstats:
        sstats = np.zeros_like(self.expElogbeta, dtype=self.dtype)
    else:
        sstats = None
    converged = 0

    for d, doc in enumerate(chunk):
        if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
            # make sure the term IDs are ints, otherwise np will get upset
            ids = [int(idx) for idx, _ in doc]
        else:
            ids = [idx for idx, _ in doc]
        cts = np.array([cnt for _, cnt in doc], dtype=self.dtype)
        gammad = gamma[d, :]
        Elogthetad = Elogtheta[d, :]
        expElogthetad = expElogtheta[d, :]
        expElogbetad = self.expElogbeta[:, ids]

        # The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_w.
        # phinorm is the normalizer.
        # TODO treat zeros explicitly, instead of adding epsilon?
        eps = DTYPE_TO_EPS[self.dtype]
        phinorm = np.dot(expElogthetad, expElogbetad) + eps

        # Iterate between gamma and phi until convergence
        for _ in xrange(self.iterations):
            lastgamma = gammad
            gammad = self.alpha + expElogthetad * np.dot(cts / phinorm, expElogbetad.T)
            Elogthetad = dirichlet_expectation(gammad)
            expElogthetad = np.exp(Elogthetad)
            phinorm = np.dot(expElogthetad, expElogbetad) + eps
            # If gamma hasn't changed much, we're done.
            meanchange = mean_absolute_difference(gammad, lastgamma)
            if meanchange < self.gamma_threshold:
                converged += 1
                break
        gamma[d, :] = gammad
        assert gammad.dtype == self.dtype
        if collect_sstats:
            # Contribution of document d to the expected sufficient
            # statistics for the M step.
            sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)

    if len(chunk) > 1:
        logger.debug("%i/%i documents converged within %i iterations", converged, len(chunk), self.iterations)

    if collect_sstats:
        # This step finishes computing the sufficient statistics for the
        # M step, so that
        # sstats[k, w] = \sum_d n_{dw} * phi_{dwk}
        # = \sum_d n_{dw} * exp{Elogtheta_{dk} + Elogbeta_{kw}} / phinorm_{dw}.
        sstats *= self.expElogbeta
        assert sstats.dtype == self.dtype

    assert gamma.dtype == self.dtype
    return gamma, sstats


def modify_lda_inference():
    LdaModel.inference = inference

Usage:

from lda_model_modify import modify_lda_inference
modify_lda_inference()
from gensim.models import LdaMulticore

@menshikh-iv menshikh-iv removed the need info Not enough information for reproduce an issue, need more info from author label Dec 13, 2018
@menshikh-iv
Copy link
Contributor

Nice catch, thanks @zkwhandan 👍

menshikh-iv pushed a commit that referenced this issue Jan 9, 2019
* Fix #2115: Replace custom epsilons with automatic numpy equivalent

* fix typo
@Yukisu03
Copy link

Yukisu03 commented Jul 26, 2019

I also meet the problem when I ran an LDA from Gensim library. Here is the error:

/anaconda3/lib/python3.6/site-packages/gensim/models/ldamodel.py:678: RuntimeWarning: overflow encountered in exp expElogthetad = np.exp(Elogthetad).

After going through the answers mentioned above, I tried to update my Numpy version and Gensim version to the updated one. However, the problem is still here. My dataset includes about 10,000 tweets. Btw, I tried to use 5 tweets, it seems no problem in generating the topics.

Hope to get a response soon. Thank you!

@notAmine
Copy link

I'm encountering the same issue when using a large number of topics (+200)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects
None yet
Development

No branches or pull requests