gensim.models.LDAmodel producing NaN & same words in each topic #2115

czhao028 · 2018-07-03T16:01:49Z

Description

Here is a brief introduction on StackOverflow; I thought I'd post this here too because the other StackOverflow question with the exact same issue as mine hasn't gotten even a single response in 2 weeks.

Link: https://stackoverflow.com/questions/51142294/gensim-ldamodel-error-nan-and-all-topics-the-same

Steps/Code/Corpus to Reproduce

#create pandas frame object w/ default rows
def tokenize(pd_object):
    for i, row in pd_object.iterrows():
        id = row["ID"]
        sentences = split_sentences(str(row["Comment"]))
        """ **Time Consuming** """
        tokens =  [[id, sent, gensim.parsing.preprocessing.preprocess_string(sent.lower(), filters=[strip_punctuation,
            strip_multiple_whitespaces, strip_numeric, strip_short, wordnet_stem])] for sent in sentences]
#append tokens to new pandas dataframe object

def train(pd_object):
    t1 = time.time()
    phrases_and_tokens = tokenize(pd_object)
    bag_of_words = phrases_and_tokens["Tokens"].tolist()
    t2 = time.time()
    print("Time Taken %12f" % (t2-t1))

    bigram = gensim.models.Phrases(bag_of_words, threshold=1)
    bigram_mod = gensim.models.phrases.Phraser(bigram)

    texts = [filter_stop(bigram_mod[t]) for t in bag_of_words]

    id2word = corpora.Dictionary(texts)
    sent_wordfreq = [id2word.doc2bow(sent) for sent in texts]

    lda_model = gensim.models.ldamodel.LdaModel(corpus=sent_wordfreq,
                                                id2word=id2word,
                                                num_topics=5)

    print(lda_model.print_topics())

-->

Expected Results

Something like this:

[(0,
  '0.025*"game" + 0.018*"team" + 0.016*"year" + 0.014*"play" + 0.013*"good" + '
  '0.012*"player" + 0.011*"win" + 0.007*"season" + 0.007*"hockey" + '
  '0.007*"fan"'),
 (1,
  '0.021*"window" + 0.015*"file" + 0.012*"image" + 0.010*"program" + '
  '0.010*"version" + 0.009*"display" + 0.009*"server" + 0.009*"software" + '
  '0.008*"graphic" + 0.008*"application"'),
 (2,
  '0.021*"gun" + 0.019*"state" + 0.016*"law" + 0.010*"people" + 0.008*"case" + '
  '0.008*"crime" + 0.007*"government" + 0.007*"weapon" + 0.007*"police" + '
  '0.006*"firearm"'),
 (3,
  '0.855*"ax" + 0.062*"max" + 0.002*"tm" + 0.002*"qax" + 0.001*"mf" + '
  '0.001*"giz" + 0.001*"_" + 0.001*"ml" + 0.001*"fp" + 0.001*"mr"'),
 (4,
  '0.020*"file" + 0.020*"line" + 0.013*"read" + 0.013*"set" + 0.012*"program" '
  '+ 0.012*"number" + 0.010*"follow" + 0.010*"error" + 0.010*"change" + '
  '0.009*"entry"'),
 (5,
  '0.021*"god" + 0.016*"christian" + 0.008*"religion" + 0.008*"bible" + '
  '0.007*"life" + 0.007*"people" + 0.007*"church" + 0.007*"word" + 0.007*"man" '
  '+ 0.006*"faith"'),
 (..truncated..)]

Actual Results

[(0, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ....
(1, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ...
(2, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ...
(3, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ..)
(4, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ..)]

Please paste or specifically describe the actual output or traceback. -->

Versions

>>> import platform; print(platform.platform())
Darwin-17.6.0-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
Python 3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.14.5
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.1.0
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.4.0
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1

I think it probably has to do with a numpy issue but all my attempts to upgrade and reinstall have been fruitless. Another coworker ran this on his computer and it worked just fine. Probably a recent update in numpy has caused this recent issue (there's one other person who posted it on StackOverflow 2 weeks ago) but uninstalling packages has broken so many things that I don't want to take the risk. However, I am trying to learn how to use virtual environments and see if I can test out different versions of numpy with this code. Thank you! Hope to get a response soon.

groceryheist · 2018-07-07T22:42:16Z

I also encountered this issue. I believe it is a bug introduced in a recent version of Gensim. Downgrading to gensim 3.1.0 solved the problem for me.

RanAR90 · 2018-07-20T08:40:01Z

Hello all

I have also encountered the same problem, I am using gensim 3.5.0.
I have trained couple other models earlier but they were all fine. I have only got this when I was trying to train on a corpus of 100K English wikipedia articles.
I have got this warnings during the training:
RuntimeWarning: divide by zero encountered in log diff = np.log(self.expElogbeta)
RuntimeWarning: overflow encountered in add sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)

I was using 30 passes in the training, but when I changed to the passes to 1, I got normal Topics!! and only the divide by zero warning.
Just thought sharing this might help.

Best Regards

menshikh-iv · 2018-07-31T05:25:02Z

Thanks for report @czhao028,

can you please share a small reproducible example (now I can't run your code, because this isn't complete, data is missing)?

czhao028 · 2018-07-31T14:16:42Z

I can't give you a sample of my data because it's confidential data, but phrases_and_tokens["Phrases"] is a pd.Series object containing rows of keywords created by this portion of the code:

gensim.parsing.preprocessing.preprocess_string(sent.lower(), filters=[strip_punctuation, strip_multiple_whitespaces, strip_numeric, strip_short, wordnet_stem] for sent in sentences

after reviewing the tokenize method, it's outdated so I've included the most recent version below:

where token_helper is essentially the first line of code I mentioned earlier in this comment

menshikh-iv · 2018-08-01T05:11:48Z

@czhao028 that's really sad because if we can't reproduce an issue - we have no chance to fix it, can you try to reproduce this error with a publicly available dataset please?

czhao028 · 2018-08-03T17:51:50Z

@RaniemAR you wanna jump in here?

menshikh-iv · 2018-08-07T15:33:31Z

@czhao028 ping

snollygoster123123 · 2018-08-14T21:25:03Z

I do have the same problem. For me as soon as I try 70 or more topics, I only get NaN. I tried it on two different computers and I tried a lot of combinations between gensim versions and numpy version. Nothing helped.

menshikh-iv · 2018-08-15T02:05:06Z

Hi @snollygoster123123, can you give more information (exact code with dataset, python/os/gensim version), we need to reproduce this issue first.

snollygoster123123 · 2018-08-15T22:08:01Z

Hi @menshikh-iv, I solved the problem by taking the singlecore LDA Model. My Dataset consists of 60k Documents (Each approximately as long as a Wikipedia article ).
Worked fine with LDA Multicore for 10-60 topics. Anything above will result in only NaNs.
The line for the lda was:

lda = gensim.models.ldamulticore.LdaMulticore(corpus, 
    id2word=dictionary, num_topics=80, chunksize=1800, passes=20, 
    workers=1, eval_every=1, iterations=1000)

I think my post is wrong here in this issue, because OP is using single core. If you want to, you can delete my post or move it.

menshikh-iv · 2018-08-16T01:40:55Z

@snollygoster123123 can you share corpus please?

snollygoster123123 · 2018-08-17T15:50:48Z

@menshikh-iv Corpus is 170MB. The only way I had to upload it was uploaded.
http://uploaded.net/file/3bzy5v6p
In case you have a better way, please let me know.
Also I do have the same problem now, also for the single core version (whenever I use 80 or more topics). If you are able to generate 80 topics, please let me know.

piskvorky · 2018-08-18T20:56:16Z

@menshikh-iv if downgrading to Gensim 3.1.0 helps like @groceryheist says, it must be an issue with the recent additions. IIRC, there was some PR that reimplemented parts of LDA in C/Cython, right?

Maybe it used the wrong precision (floats, single precision)? If the error is due to such numeric issues, I'm thinking it's possible it only manifests itself on larger datasets, and so our unit tests didn't catch it.

csmyth76 · 2018-09-16T23:06:10Z

If you want to reproduce the error: I get it when I run the code here:
https://datascienceplus.com/topic-modeling-in-python-with-nltk-and-gensim/

...and remove:
if random.random() > .99:

There's a link on that page to the github that has the code and corpus.

menshikh-iv · 2018-09-17T02:40:11Z

@piskvorky I think you are right, this is definitely a numeric issue, also, can be related with #1927

piskvorky · 2018-09-17T05:59:02Z

@csmyth76 @snollygoster123123 @RaniemAR @czhao028 any appetite for looking into this and fixing the numerical bug?

I don't know when we'll get to this ourselves, so help would be welcome. It looks like a serious issue with a potentially simple fix.

johann-petrak · 2018-11-09T21:34:20Z

I just got the same problem, out of the blue, running gensim version 3.4.0.

anaconda3/lib/python3.6/site-packages/gensim/models/ldamodel.py:775: RuntimeWarning: divide by zero encountered in log
diff = np.log(self.expElogbeta)

I have run the same task without problems with slightly different versions of the corpus.
So it seems there is some very specific situation here which cannot be easily be reproduced by a minimal test case. I sadly also cannot share the corpus as it is licensed.

I do not really understand the code enough to be of much help but would a simple
guard against trying to get log(0) and setting diff to 0 in that case be a workaround here?

Apparently expElogbeta is set to np.exp(self.state.get_Elogbeta()) which means that the result of self.state.get_Elogbeta() must be -Inf which in turn means that dirichlet_expectation(self.get_lambda()) must be -Inf which means that self.get_lambda() must be zero? Not sure how that could ever happen or if my train of thought is wrong here ...

johann-petrak · 2018-11-10T14:26:37Z

This closed issue appears to be about the same problem and may contain relevant information: #217

johann-petrak · 2018-11-12T08:18:11Z

OK, I checked and in my case there are many values in self.state.sstats which are zero.
Then self.expElogbeta = np.exp(dirichlet_expectation(self.state.sstats)) and diff = np.log(self.expElogbeta) and then taking the mean of anything that has at least one Inf value in it causes the topic diff to be Inf.

Now, I do not know exactly what the implications should be if some sstats are zero, but I think they should definitely not have an influence on the topic diff like this, but maybe also not on other locations where we get +/- Inf or NaN because of those zeroes? The code appears to alternate between calculating the exp and log (which is -Inf for 0) quite frequently, and the digamma function for the dirichlet expectation (which is Inf for 0). Maybe there is a strategy to correctly handle the calculation of these functions for values which are ultimately coming from those zero counts?

zkwhandan · 2018-11-30T07:13:25Z

I find a solution to solve this problem.
At line 666 in ldamodel.py, there is a TODO.
# TODO treat zeros explicitly, instead of adding epsilon?
eps = DTYPE_TO_EPS[self.dtype]
phinorm = np.dot(expElogthetad, expElogbetad) + eps

this eps is too small. When I increase it, non disappear.

create a file:

from gensim.models.ldamodel import *


DTYPE_TO_EPS = {
    np.float16: 1e-5,
    np.float32: 1e-25,      # <<<<=========== THE VALUE I CHANGE ===========
    np.float64: 1e-100,
}


def inference(self, chunk, collect_sstats=False):
    try:
        len(chunk)
    except TypeError:
        # convert iterators/generators to plain list, so we have len() etc.
        chunk = list(chunk)
    if len(chunk) > 1:
        logger.debug("performing inference on a chunk of %i documents", len(chunk))

    # Initialize the variational distribution q(theta|gamma) for the chunk
    gamma = self.random_state.gamma(100., 1. / 100., (len(chunk), self.num_topics)).astype(self.dtype, copy=False)
    Elogtheta = dirichlet_expectation(gamma)
    expElogtheta = np.exp(Elogtheta)

    assert Elogtheta.dtype == self.dtype
    assert expElogtheta.dtype == self.dtype

    if collect_sstats:
        sstats = np.zeros_like(self.expElogbeta, dtype=self.dtype)
    else:
        sstats = None
    converged = 0

    for d, doc in enumerate(chunk):
        if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
            # make sure the term IDs are ints, otherwise np will get upset
            ids = [int(idx) for idx, _ in doc]
        else:
            ids = [idx for idx, _ in doc]
        cts = np.array([cnt for _, cnt in doc], dtype=self.dtype)
        gammad = gamma[d, :]
        Elogthetad = Elogtheta[d, :]
        expElogthetad = expElogtheta[d, :]
        expElogbetad = self.expElogbeta[:, ids]

        # The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_w.
        # phinorm is the normalizer.
        # TODO treat zeros explicitly, instead of adding epsilon?
        eps = DTYPE_TO_EPS[self.dtype]
        phinorm = np.dot(expElogthetad, expElogbetad) + eps

        # Iterate between gamma and phi until convergence
        for _ in xrange(self.iterations):
            lastgamma = gammad
            gammad = self.alpha + expElogthetad * np.dot(cts / phinorm, expElogbetad.T)
            Elogthetad = dirichlet_expectation(gammad)
            expElogthetad = np.exp(Elogthetad)
            phinorm = np.dot(expElogthetad, expElogbetad) + eps
            # If gamma hasn't changed much, we're done.
            meanchange = mean_absolute_difference(gammad, lastgamma)
            if meanchange < self.gamma_threshold:
                converged += 1
                break
        gamma[d, :] = gammad
        assert gammad.dtype == self.dtype
        if collect_sstats:
            # Contribution of document d to the expected sufficient
            # statistics for the M step.
            sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)

    if len(chunk) > 1:
        logger.debug("%i/%i documents converged within %i iterations", converged, len(chunk), self.iterations)

    if collect_sstats:
        # This step finishes computing the sufficient statistics for the
        # M step, so that
        # sstats[k, w] = \sum_d n_{dw} * phi_{dwk}
        # = \sum_d n_{dw} * exp{Elogtheta_{dk} + Elogbeta_{kw}} / phinorm_{dw}.
        sstats *= self.expElogbeta
        assert sstats.dtype == self.dtype

    assert gamma.dtype == self.dtype
    return gamma, sstats


def modify_lda_inference():
    LdaModel.inference = inference

Usage:

from lda_model_modify import modify_lda_inference
modify_lda_inference()
from gensim.models import LdaMulticore

menshikh-iv · 2018-12-13T10:08:08Z

Nice catch, thanks @zkwhandan 👍

…valent

* Fix #2115: Replace custom epsilons with automatic numpy equivalent * fix typo

Yukisu03 · 2019-07-26T03:44:02Z

I also meet the problem when I ran an LDA from Gensim library. Here is the error:

/anaconda3/lib/python3.6/site-packages/gensim/models/ldamodel.py:678: RuntimeWarning: overflow encountered in exp expElogthetad = np.exp(Elogthetad).

After going through the answers mentioned above, I tried to update my Numpy version and Gensim version to the updated one. However, the problem is still here. My dataset includes about 10,000 tweets. Btw, I tried to use 5 tweets, it seems no problem in generating the topics.

Hope to get a response soon. Thank you!

notAmine · 2019-10-14T14:57:41Z

I'm encountering the same issue when using a large number of topics (+200)

piskvorky added the bug Issue described a bug label Jul 8, 2018

menshikh-iv added the need info Not enough information for reproduce an issue, need more info from author label Jul 31, 2018

menshikh-iv removed the need info Not enough information for reproduce an issue, need more info from author label Dec 13, 2018

menshikh-iv added the difficulty easy Easy issue: required small fix label Dec 13, 2018

menshikh-iv mentioned this issue Dec 13, 2018

Identical topics #416

Closed

horpto mentioned this issue Dec 24, 2018

get_document_topics returns and empty list #2306

Closed

horpto added a commit to horpto/gensim that referenced this issue Dec 24, 2018

Fix piskvorky#2115: Replace custom epsilons with automatic numpy equi…

cfbc0d5

…valent

horpto added a commit to horpto/gensim that referenced this issue Dec 24, 2018

Fix piskvorky#2115: Replace custom epsilons with automatic numpy equi…

a75062c

…valent

horpto mentioned this issue Jan 9, 2019

Replace custom epsilons with numpy equivalent in LdaModel #2308

Merged

menshikh-iv closed this as completed in #2308 Jan 9, 2019

menshikh-iv pushed a commit that referenced this issue Jan 9, 2019

Replace custom epsilons with numpy equivalent in LdaModel (#2308)

1b07f81

* Fix #2115: Replace custom epsilons with automatic numpy equivalent * fix typo

mpenkov mentioned this issue Nov 10, 2019

RuntimeWarning: overflow encountered in exp expElogthetad #2674

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gensim.models.LDAmodel producing NaN & same words in each topic #2115

gensim.models.LDAmodel producing NaN & same words in each topic #2115

czhao028 commented Jul 3, 2018 •

edited by menshikh-iv

Loading

groceryheist commented Jul 7, 2018

RanAR90 commented Jul 20, 2018 •

edited

Loading

menshikh-iv commented Jul 31, 2018

czhao028 commented Jul 31, 2018 •

edited

Loading

menshikh-iv commented Aug 1, 2018

czhao028 commented Aug 3, 2018

menshikh-iv commented Aug 7, 2018

snollygoster123123 commented Aug 14, 2018

menshikh-iv commented Aug 15, 2018

snollygoster123123 commented Aug 15, 2018 •

edited by menshikh-iv

Loading

menshikh-iv commented Aug 16, 2018

snollygoster123123 commented Aug 17, 2018 •

edited

Loading

piskvorky commented Aug 18, 2018 •

edited

Loading

csmyth76 commented Sep 16, 2018 •

edited

Loading

menshikh-iv commented Sep 17, 2018

piskvorky commented Sep 17, 2018 •

edited

Loading

johann-petrak commented Nov 9, 2018

johann-petrak commented Nov 10, 2018

johann-petrak commented Nov 12, 2018

zkwhandan commented Nov 30, 2018 •

edited by mpenkov

Loading

menshikh-iv commented Dec 13, 2018

Yukisu03 commented Jul 26, 2019 •

edited by mpenkov

Loading

notAmine commented Oct 14, 2019

gensim.models.LDAmodel producing NaN & same words in each topic #2115

gensim.models.LDAmodel producing NaN & same words in each topic #2115

Comments

czhao028 commented Jul 3, 2018 • edited by menshikh-iv Loading

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

groceryheist commented Jul 7, 2018

RanAR90 commented Jul 20, 2018 • edited Loading

menshikh-iv commented Jul 31, 2018

czhao028 commented Jul 31, 2018 • edited Loading

menshikh-iv commented Aug 1, 2018

czhao028 commented Aug 3, 2018

menshikh-iv commented Aug 7, 2018

snollygoster123123 commented Aug 14, 2018

menshikh-iv commented Aug 15, 2018

snollygoster123123 commented Aug 15, 2018 • edited by menshikh-iv Loading

menshikh-iv commented Aug 16, 2018

snollygoster123123 commented Aug 17, 2018 • edited Loading

piskvorky commented Aug 18, 2018 • edited Loading

csmyth76 commented Sep 16, 2018 • edited Loading

menshikh-iv commented Sep 17, 2018

piskvorky commented Sep 17, 2018 • edited Loading

johann-petrak commented Nov 9, 2018

johann-petrak commented Nov 10, 2018

johann-petrak commented Nov 12, 2018

zkwhandan commented Nov 30, 2018 • edited by mpenkov Loading

menshikh-iv commented Dec 13, 2018

Yukisu03 commented Jul 26, 2019 • edited by mpenkov Loading

notAmine commented Oct 14, 2019

czhao028 commented Jul 3, 2018 •

edited by menshikh-iv

Loading

RanAR90 commented Jul 20, 2018 •

edited

Loading

czhao028 commented Jul 31, 2018 •

edited

Loading

snollygoster123123 commented Aug 15, 2018 •

edited by menshikh-iv

Loading

snollygoster123123 commented Aug 17, 2018 •

edited

Loading

piskvorky commented Aug 18, 2018 •

edited

Loading

csmyth76 commented Sep 16, 2018 •

edited

Loading

piskvorky commented Sep 17, 2018 •

edited

Loading

zkwhandan commented Nov 30, 2018 •

edited by mpenkov

Loading

Yukisu03 commented Jul 26, 2019 •

edited by mpenkov

Loading