-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] NMF notebook and logging fixups #2481
Conversation
2ae8579
to
39cdd77
Compare
39cdd77
to
5155e4e
Compare
@anotherbugmaster what is the status? Do you plan to finish the NMF implementation in Gensim? |
@mpenkov we have to decide what to do about NMF. It seems @anotherbugmaster doesn't have the capacity to finish it, it's been dragging on for more than a year. I fear we'll be stuck supporting yet another not-quite-finished algorithm. I see two options: 1) we finish NMF ourselves, or 2) we remove NMF from Gensim. |
I'm a bit torn. On one hand, there's still a long way to go before this is 100% done. I don't think we have the capacity to finish this ourselves. On the other, there's been a lot of effort on this, it'd be a waste to just discard it. @anotherbugmaster What is your opinion? What sort of timeline do you have in mind for finishing this work? |
@piskvorky, @mpenkov, It's all done, the only problems are Appveyor tests under 3.5 and 3.6 and Travis tests that refuses to run at all for some reason. I would be glad if you'd help me out there, cause I don't have a Windows machine and don't have any idea what's happened to Travis. |
Try merging master in. That should fix at least some of the tests. |
Nope, it didn't work, same issue. :( |
Looks like this is the cause of the problem:
The values are slightly off. Could it be a bug? Please investigate. If it's not a bug, we can relax the tolerances on those tests. |
@mpenkov, the thing is, these values are off only on the 3.5 and 3.6 on Windows, every other platform returns the correct value. We could, of course, relax the constraint, but wouldn't it be too loose? |
Relax the constraint under those conditions only (Windows, Py3.5 and 3.6), and add an informative comment linking to this discussion. |
@anotherbugmaster @mpenkov What's the status here? It's unlikely math operations work differently under Windows, so this must be a bug. Either in Gensim NMF or, less likely, higher up the Stack: Python, numpy etc. Or is such non-determinism expected? What is its source? It shouldn't be hard to track down where the computed values start diverging, although I understand that doing it via CI (unless someone has a Windows machine they can use) is not very convenient. |
@anotherbugmaster Ping on this. Are you able to diagnose the problem yourself? |
Hey, I've spent some time investigating the issue of divergence between Windows (10) and Linux (Ubuntu 18.04, virtual machine using virtualbox). I used the following code to produce these outputs: from gensim.models import nmf
from gensim.test.utils import common_corpus, common_dictionary
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)
model = nmf.Nmf(
common_corpus,
id2word=common_dictionary,
chunksize=1,
num_topics=2,
passes=100,
random_state=42,
)
print("_W:", model._W)
doc = list(common_corpus)[0]
transformed = model[doc] Debug logging points us in the direction of the h_error. In this iterative function (it does multiple iterations for each update) the error diverges between Windows and Linux from the first iteration of the NMF algorithm: (First Windows, then Linux)
Investigating this further, it appears that there is an issue initializing WtW in the function gensim/models/nmf.py Nmf._solveproj: while the value of Wt is the same, there is a very small difference between Linux and Windows here. After exporting with np.save() and importing in Windows I get for the first iteration:
It is my working hypothesis that this difference is enlarged due to the iterations. Counter arguments might be that the difference is smaller than the machine precision for float64 (e-16), but the first divergence of the error is e-18 (3th iteration). It is noted here (numpy/numpy#9187) that Numpy does not aim to provide exactly the same results on different platforms. As a conclusion I think the non-determinism can be expected between the platforms and the solution of @mpenkov of relaxing the constraints under the conditions should be sound :) |
Awesome, thanks for the detective work @Maocx ! The algo could use a "champion" in Gensim, who understands its implementation and is able to keep it sane going forward. @mpenkov @anotherbugmaster was there anything else (besides the Windows divergence) that needed finishing up / polishing? Cheers. |
To the best of my recollection, divergence was the only thing left. |
Alright! Let's finish it up & release NMF officially then 🚀 |
Hello everyone. Docs are ready, though I haven't revisited them since May. Anyway, I think everything should work OK, I'll relax the constraints and merge master. |
@piskvorky @mpenkov I guess we could merge it, what do you think? |
@anotherbugmaster, I'm a bit disappointed in the difference between the generated W matrix with different RandomState values. I found that you introduced them around January 29, can you still remember your reasoning for including them? For instance: the weight of the second topic here changes by nearly 50% by changing the randomstate: import logging
import numpy as np
from gensim.models import nmf
from gensim.test.utils import common_corpus, common_dictionary
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
def perform_nmf(random_state=None):
"""
Perform nmf/
"""
model = nmf.Nmf(
common_corpus,
id2word=common_dictionary,
chunksize=1,
num_topics=2,
passes=100,
random_state=random_state,
h_stop_condition=10e-6
)
return model
model1 = perform_nmf(random_state=42)
model2 = perform_nmf(random_state=43)
diff = model1._W - model2._W
t = np.sum(diff)
print(t)
print(model1._W)
print(model2._W)
>>>
0.4181815407169876
[[0.25698389 0.13212215]
[0. 0.46704016]
[0.02863034 0.35457582]
[0.43792164 0. ]
[0.32878853 0. ]
[0.35545745 0.66227907]
[0.43792164 0. ]
[0.51330699 0.06626947]
[0.06355195 0.44236637]
[0.02828574 0. ]
[0.07657501 0. ]
[0.06688829 0. ]]
[[0.25543047 0.09710711]
[0. 0.36720422]
[0.01217318 0.27417546]
[0.43444168 0. ]
[0.32464544 0. ]
[0.23349835 0.66157953]
[0.43444168 0. ]
[0.46870586 0.1142939 ]
[0. 0.41854257]
[0.03644202 0. ]
[0.09030456 0. ]
[0.0777969 0. ]] I tried to follow the cited paper to check the implementation and had some difficulties to identify what steps you used, perhaps adding them somewhere in the docstring would be nice :) This is what I noted down from the exercise:
|
@Maocx, concerning your questions:
|
Thanks for your answer! Sorry for mixing my questions up a bit:
|
@Maocx |
@anotherbugmaster @Maocx Ping on this PR. How's it going? Looks like there is a merge conflict - can you please resolve it? Is there anything else left to do before we merge? |
@mpenkov seems like the merge conflict was caused by rtol that got even more strict: now it's 1e-3 instead of 1e-2. I don't know why rtol was changed in |
Actually And all tests in that PR passed, so not sure why 1e-2 would be needed? |
Windows tests wouldn't work with some versions of Python. You can accept current version with 1e-3 if that's not the case anymore |
703721d
to
47e60f3
Compare
Made a rebase