Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] NMF notebook and logging fixups #2481

Merged
merged 9 commits into from
Mar 14, 2021

Conversation

anotherbugmaster
Copy link
Contributor

Made a rebase

gensim/models/nmf.py Outdated Show resolved Hide resolved
gensim/test/test_nmf.py Outdated Show resolved Hide resolved
gensim/test/test_nmf.py Outdated Show resolved Hide resolved
@piskvorky
Copy link
Owner

piskvorky commented Jun 3, 2019

@anotherbugmaster what is the status? Do you plan to finish the NMF implementation in Gensim?

@piskvorky
Copy link
Owner

piskvorky commented Jun 12, 2019

@mpenkov we have to decide what to do about NMF.

It seems @anotherbugmaster doesn't have the capacity to finish it, it's been dragging on for more than a year. I fear we'll be stuck supporting yet another not-quite-finished algorithm.

I see two options: 1) we finish NMF ourselves, or 2) we remove NMF from Gensim.

@piskvorky piskvorky added the stale Waiting for author to complete contribution, no recent effort label Jun 12, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Jun 21, 2019

I'm a bit torn. On one hand, there's still a long way to go before this is 100% done. I don't think we have the capacity to finish this ourselves. On the other, there's been a lot of effort on this, it'd be a waste to just discard it.

@anotherbugmaster What is your opinion? What sort of timeline do you have in mind for finishing this work?

@anotherbugmaster
Copy link
Contributor Author

@piskvorky, @mpenkov, It's all done, the only problems are Appveyor tests under 3.5 and 3.6 and Travis tests that refuses to run at all for some reason. I would be glad if you'd help me out there, cause I don't have a Windows machine and don't have any idea what's happened to Travis.

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 21, 2019

Try merging master in. That should fix at least some of the tests.

@anotherbugmaster
Copy link
Contributor Author

Nope, it didn't work, same issue. :(

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 23, 2019

Looks like this is the cause of the problem:

doc        = [(0, 1), (1, 1), (2, 1)]
expected   = [0.02991635, 0.97008365]
self       = <gensim.test.test_nmf.TestNmf testMethod=testTransform>
transformed = [(0, 0.029590818817693605), (1, 0.97040918118230646)]
vec        = array([ 0.02959082,  0.97040915], dtype=float32)

The values are slightly off. Could it be a bug? Please investigate. If it's not a bug, we can relax the tolerances on those tests.

@anotherbugmaster
Copy link
Contributor Author

anotherbugmaster commented Jun 24, 2019

@mpenkov, the thing is, these values are off only on the 3.5 and 3.6 on Windows, every other platform returns the correct value. We could, of course, relax the constraint, but wouldn't it be too loose?

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 25, 2019

Relax the constraint under those conditions only (Windows, Py3.5 and 3.6), and add an informative comment linking to this discussion.

@piskvorky
Copy link
Owner

piskvorky commented Aug 21, 2019

@anotherbugmaster @mpenkov What's the status here? It's unlikely math operations work differently under Windows, so this must be a bug. Either in Gensim NMF or, less likely, higher up the Stack: Python, numpy etc. Or is such non-determinism expected? What is its source?

It shouldn't be hard to track down where the computed values start diverging, although I understand that doing it via CI (unless someone has a Windows machine they can use) is not very convenient.

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 27, 2019

@anotherbugmaster Ping on this. Are you able to diagnose the problem yourself?

@Maocx
Copy link

Maocx commented Nov 4, 2019

Hey, I've spent some time investigating the issue of divergence between Windows (10) and Linux (Ubuntu 18.04, virtual machine using virtualbox). I used the following code to produce these outputs:

from gensim.models import nmf
from gensim.test.utils import common_corpus, common_dictionary

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

model = nmf.Nmf(
    common_corpus,
    id2word=common_dictionary,
    chunksize=1,
    num_topics=2,
    passes=100,
    random_state=42,
)

print("_W:", model._W)

doc = list(common_corpus)[0]
transformed = model[doc]

Debug logging points us in the direction of the h_error. In this iterative function (it does multiple iterations for each update) the error diverges between Windows and Linux from the first iteration of the NMF algorithm: (First Windows, then Linux)

019-11-04 17:43:40,212 : INFO : running NMF training, 2 topics, 100 passes over the supplied corpus of 9 documents, evaluating l2 norm every 9 documents
2019-11-04 17:43:40,212 : INFO : PROGRESS: pass 0, at document 1/9
2019-11-04 17:43:40,218 : DEBUG : h_error: None
2019-11-04 17:43:40,219 : DEBUG : h_error: 0.13008216155522226
2019-11-04 17:43:40,219 : DEBUG : h_error: 0.04467003211606769
2019-11-04 17:43:40,219 : DEBUG : h_error: 0.027791418113184013
2019-11-04 17:43:40,220 : DEBUG : h_error: 0.008731901498833499
2019-11-04 17:43:40,220 : DEBUG : h_error: 0.0027435125287528935
2019-11-04 17:43:40,220 : DEBUG : h_error: 0.0008619956370820014
2019-11-04 17:44:25,750 : INFO : running NMF training, 2 topics, 100 passes over the supplied corpus of 9 documents, evaluating l2 norm every 9 documents
2019-11-04 17:44:25,751 : INFO : PROGRESS: pass 0, at document 1/9
2019-11-04 17:44:25,760 : DEBUG : h_error: None
2019-11-04 17:44:25,762 : DEBUG : h_error: 0.13008216155522226
2019-11-04 17:44:25,763 : DEBUG : h_error: 0.04467003211606769
2019-11-04 17:44:25,763 : DEBUG : h_error: 0.027791418113184016
2019-11-04 17:44:25,764 : DEBUG : h_error: 0.0087319014988335
2019-11-04 17:44:25,764 : DEBUG : h_error: 0.00274351252875292
2019-11-04 17:44:25,765 : DEBUG : h_error: 0.000861995637081996

Investigating this further, it appears that there is an issue initializing WtW in the function gensim/models/nmf.py Nmf._solveproj: while the value of Wt is the same, there is a very small difference between Linux and Windows here. After exporting with np.save() and importing in Windows I get for the first iteration:

Wt
Out[3]: 
array([[0.05069568, 0.06610443, 0.02389818, 0.16117773, 0.04791553,
        0.04729737, 0.02469517, 0.17604869, 0.10337164, 0.09267482,
        0.14958715, 0.00689207],
       [0.01411154, 0.15544358, 0.0238965 , 0.07832598, 0.0553748 ,
        0.04753334, 0.19527335, 0.05738823, 0.03207273, 0.14414264,
        0.0230432 , 0.14541275]])
Wt_lin
Out[4]: 
array([[0.05069568, 0.06610443, 0.02389818, 0.16117773, 0.04791553,
        0.04729737, 0.02469517, 0.17604869, 0.10337164, 0.09267482,
        0.14958715, 0.00689207],
       [0.01411154, 0.15544358, 0.0238965 , 0.07832598, 0.0553748 ,
        0.04753334, 0.19527335, 0.05738823, 0.03207273, 0.14414264,
        0.0230432 , 0.14541275]])
Wt - Wt_lin
Out[5]: 
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
WtW
Out[6]: 
array([[0.1113233 , 0.0651363 ],
       [0.0651363 , 0.12130034]])
WtW_lin
Out[7]: 
array([[0.1113233 , 0.0651363 ],
       [0.0651363 , 0.12130034]])
WtW - WtW_lin
Out[8]: 
array([[0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 1.38777878e-17]])
Wtv - Wtv_lin
Out[9]: 
array([[0.],
       [0.]])

It is my working hypothesis that this difference is enlarged due to the iterations. Counter arguments might be that the difference is smaller than the machine precision for float64 (e-16), but the first divergence of the error is e-18 (3th iteration).

It is noted here (numpy/numpy#9187) that Numpy does not aim to provide exactly the same results on different platforms. As a conclusion I think the non-determinism can be expected between the platforms and the solution of @mpenkov of relaxing the constraints under the conditions should be sound :)

@piskvorky
Copy link
Owner

piskvorky commented Nov 4, 2019

Awesome, thanks for the detective work @Maocx !
What motivated your investigation? Do you need online NMF yourself?

The algo could use a "champion" in Gensim, who understands its implementation and is able to keep it sane going forward.

@mpenkov @anotherbugmaster was there anything else (besides the Windows divergence) that needed finishing up / polishing? Cheers.

@mpenkov
Copy link
Collaborator

mpenkov commented Nov 5, 2019

To the best of my recollection, divergence was the only thing left.

@piskvorky piskvorky marked this pull request as ready for review November 5, 2019 08:14
@piskvorky
Copy link
Owner

piskvorky commented Nov 5, 2019

Alright! Let's finish it up & release NMF officially then 🚀
Are its docs ready too?

@anotherbugmaster
Copy link
Contributor Author

Hello everyone. Docs are ready, though I haven't revisited them since May. Anyway, I think everything should work OK, I'll relax the constraints and merge master.

@anotherbugmaster
Copy link
Contributor Author

@piskvorky @mpenkov I guess we could merge it, what do you think?

gensim/test/test_nmf.py Outdated Show resolved Hide resolved
gensim/test/test_nmf.py Outdated Show resolved Hide resolved
@Maocx
Copy link

Maocx commented Nov 22, 2019

@anotherbugmaster, I'm a bit disappointed in the difference between the generated W matrix with different RandomState values. I found that you introduced them around January 29, can you still remember your reasoning for including them? For instance: the weight of the second topic here changes by nearly 50% by changing the randomstate:

import logging

import numpy as np
from gensim.models import nmf
from gensim.test.utils import common_corpus, common_dictionary

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


def perform_nmf(random_state=None):
    """
    Perform nmf/
    """
    model = nmf.Nmf(
        common_corpus,
        id2word=common_dictionary,
        chunksize=1,
        num_topics=2,
        passes=100,
        random_state=random_state,
        h_stop_condition=10e-6
    )
    return model


model1 = perform_nmf(random_state=42)

model2 = perform_nmf(random_state=43)
diff = model1._W - model2._W
t = np.sum(diff)

print(t)
print(model1._W)
print(model2._W)
>>>
0.4181815407169876
[[0.25698389 0.13212215]
 [0.         0.46704016]
 [0.02863034 0.35457582]
 [0.43792164 0.        ]
 [0.32878853 0.        ]
 [0.35545745 0.66227907]
 [0.43792164 0.        ]
 [0.51330699 0.06626947]
 [0.06355195 0.44236637]
 [0.02828574 0.        ]
 [0.07657501 0.        ]
 [0.06688829 0.        ]]
[[0.25543047 0.09710711]
 [0.         0.36720422]
 [0.01217318 0.27417546]
 [0.43444168 0.        ]
 [0.32464544 0.        ]
 [0.23349835 0.66157953]
 [0.43444168 0.        ]
 [0.46870586 0.1142939 ]
 [0.         0.41854257]
 [0.03644202 0.        ]
 [0.09030456 0.        ]
 [0.0777969  0.        ]]

I tried to follow the cited paper to check the implementation and had some difficulties to identify what steps you used, perhaps adding them somewhere in the docstring would be nice :) This is what I noted down from the exercise:

  • Modelling the outlier vectors is omitted, in contrast to the paper (Perhaps you remember the reason ? :) )
  • The projected gradient descent method from section IV.A is used to solve for h
  • (this is a guess) the PGD is optimized using stochastic gradient descent?

@anotherbugmaster
Copy link
Contributor Author

@Maocx, concerning your questions:

  1. Random state is what controls the initial values for the W matrix, so it's quite expected that values are changing to some degree, NMF methods are very sensitive to initialization.
  2. Unfortunately, outlier vectors caused a great performance drop and I couldn't think of a way to solve this issue.
  3. I think so, what's your point?
  4. Well, in a sense, yes. We're alternating between updating the W and the next h, and project values of these matrices on the non-negative space.

@Maocx
Copy link

Maocx commented Nov 22, 2019

Thanks for your answer! Sorry for mixing my questions up a bit:

@anotherbugmaster
Copy link
Contributor Author

anotherbugmaster commented Nov 22, 2019

@Maocx
3. Yeah, I totally agree. :)
4. I took the idea from the sklearn.decomposition.NMF implementation. Can't remember the exact reason for doing this (AFAIK the algorithm in this section is a Coordinate Descent, that's why we mix up the coordinates in order to randomize a direction of each step), but I suppose that it lead to a performance increase.

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 10, 2020

@anotherbugmaster @Maocx Ping on this PR. How's it going? Looks like there is a merge conflict - can you please resolve it? Is there anything else left to do before we merge?

@anotherbugmaster
Copy link
Contributor Author

@mpenkov seems like the merge conflict was caused by rtol that got even more strict: now it's 1e-3 instead of 1e-2. I don't know why rtol was changed in develop and, to be honest, not that interested in NMF anymore to find that out

@piskvorky
Copy link
Owner

piskvorky commented Jun 11, 2020

Actually rtol was relaxed by @menshikh-iv , from 1e-4 to 1e-3 here:
a2ec4c3#diff-ab0724b3cf3845e81150fb3a18ff045eL101

And all tests in that PR passed, so not sure why 1e-2 would be needed?

@anotherbugmaster
Copy link
Contributor Author

Windows tests wouldn't work with some versions of Python. You can accept current version with 1e-3 if that's not the case anymore

@mpenkov mpenkov added this to the 4.0.0 milestone Feb 25, 2021
@piskvorky piskvorky self-assigned this Feb 25, 2021
@piskvorky piskvorky force-pushed the nmf_add_lsi branch 2 times, most recently from 703721d to 47e60f3 Compare March 7, 2021 16:50
@piskvorky piskvorky changed the title NMF notebook and logging fixups [MRG] NMF notebook and logging fixups Mar 7, 2021
@piskvorky piskvorky requested a review from mpenkov March 7, 2021 20:23
gensim/models/nmf.py Outdated Show resolved Hide resolved
@piskvorky piskvorky removed the stale Waiting for author to complete contribution, no recent effort label Mar 9, 2021
@piskvorky piskvorky merged commit 700d6b1 into piskvorky:develop Mar 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants