-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
different Phrases behavior for same code/corpus between 0.13.4.1 and 2.1.0 #1401
Comments
Certainly a concern and maybe a worrisome regression. Without yet looking deeply into specifics of your process, some things worth checking:
|
Hi @gojomo; thanks.
This one is easy to check. I will check on the rest of the stuff. |
Yes, it turns out.
Looks like quite the hint. I train the model like this (which presumably calls
I'm trying right now to train it without the ngram stuff, just the plain sentences iterator. |
You are right about the ngrams -- removing that and throwing in the plain sentences iterator on 2.1.0 works fine. Any idea what might have caused the change? |
This is helpful narrowing. It might be a Phrases regression, we'll have to take a look at recent changes there, especially around 34759be @tmylk @ELind77 A practical workaround could be to apply the (two-level) phrase-ification just once, writing the results to a new file – which is then iterated over multiple times as a simple text corpus. (This could offer performance benefits too - the repeated phrase-promotion on each training iteration can be expensive duplicate work.) |
Thanks a lot @jeremybmerrill ! And @gojomo . Whatever the root cause is, it implies our tests are insufficient, if they allow something like this to slip through. |
Thanks @gojomo for that tip. I load the bigrams/trigrams models, then iterate through my raw corpus, saving the resulting phrase-ified tokens to a new clean file (space-delimited tokens, basically). Then I train w2v on that new clean file. Now it works as expected (i.e. producing "queen" from "man : king :: woman :_____"). Oddly though, it took about 3 hours to train everything, compared to about an hour before. Not a huge problem, but still odd. Is there a quicker way that you were envisioning to write the output of the phraseification to disk than this?
|
While the initial writing of the tokenized/phrase-combined corpus is an extra time-consuming step, I would expect each individual followup Word2Vec pass to be a little faster – because the computationally-expensive phrase-upgrading isn't being repeated each time, in a single bottleneck thread. Can you track the timing of each individual step, by either watching logging or adding time/duration printouts, to see for sure if anything has slowed? (And, to be sure that in some way the 'faster' path isn't just inadvertently doing less, such as skipping training entirely because a non-restartable iterator only allows the vocab-discovery pass, not multiple training passes?) That is roughly the way I'd write a tokenized corpus to disk. I would however tend to use the |
Yes, it works now. Just took 16 minutes to train, with the pre-written phraseified corpus. And the results are as expected. Thanks for the suggestion! I'll try re-doing the writing step with timers. |
So the remaining issue here is that something in the Phrases behavior changed between 0.13.4.1 and 2.1.0, in a way that caused your otherwise working (and proper-seeming) code to stop working. Ideally we'll find a tiny test case to exhibit the difference. Leaving this issue open, with an updated title, for finding/fixing that issue. |
CC @prakhar2b @menshikh-iv @jayantj -- let's make sure we don't transfer this newly introduced bug into the "optimized" Phrases too. |
@piskvorky okay, we'll make sure. |
@gojomo Just do be clear, did I break something? It seems like I didn't but I want to make sure so that I can fix it if I need to. |
@ELind77 I'm not sure, but the evidence above makes it seem like something changed in |
Another report on list: https://groups.google.com/d/msg/gensim/iT-7yEbQWuQ/F395xjhJDAAJ |
@menshikh-iv As per your suggestion , I would like to this up, can you provide some starting points ? Thanks a lot. |
|
@menshikh-iv I read the bug reports and tried to reproduce the error by comparing both the versions but i am getting unexpected results for both the versions , can you please suggest what corpus should i use for training the word2vec model. |
@sj29-innovate this is bug in Phrases (w2v used only for demonstration, you can ignore it). also, add |
@menshikh-iv So basically , first i should find some test case such that the phrasing output is different for different versions , right ? thanks. |
@sj29-innovate problem not with "concrete" dataset, problem with main "hint" for you here - #1401 (comment) ("train called with empty iterator") Phrases called as You can try to reproduce this behavior with w2v and any dataset (as (1) from my list). |
@menshikh-iv @gojomo I think I have found the glitch , the "_is_single" function caused this behaviour . Returning "obj" instead of iterator to obj "iter(obj)" gives correct results. obj_iter = iter(obj)
try:
peek = next(obj_iter)
obj_iter = it.chain([peek], obj_iter)
except StopIteration:
# An empty object is a single document
return True, obj
if isinstance(peek, string_types):
# It's a document, return the iterator
return True, obj_iter
else:
# If the first item isn't a string, assume obj is a corpus
return False, obj_iter Changing it to , obj_iter = iter(obj)
try:
peek = next(obj_iter)
obj_iter = it.chain([peek], obj_iter)
except StopIteration:
# An empty object is a single document
return True, obj
if isinstance(peek, string_types):
# It's a document, return the iterator
return True, obj_iter
else:
# If the first item isn't a string, assume obj is a corpus
return False, list(obj_iter) gives correct results because _is_single check is followed by "self._apply" later on and therefore returning iterator in case the object is a corpus gives an empty iterator. |
@sj29-innovate The lack of line formatting makes the code hard to compare/view; can you fix in your comment? Is there a minimal synthetic corpus that illustrates the issue, and that the behavior after the proposed-change is preferable? |
@menshikh-iv @gojomo Should i submit a PR with appropriate tests for the same ? |
@sj29-innovate of course! |
… versions , added tests for checking empty iterator
… versions , added tests for checking empty iterator
…versions , test added for empty iterator
I don't quite understand how the proposed test applies, but: does this exact test succeed in |
Yes it failed on the current version whereas worked after the fix was applied. What i deduced was that the "_is_single" function returned iterator to the object in case the object is a corpus , but this iterator was further going through "self._apply()" which again made iterator out of it , so therefore such fix worked. |
Very important comment from Gordon: #1853 (comment) |
* bm25 scoring function updated * Fixes #1401 , Phrases behavious now consistent on different versions , test added for empty iterator * Fixes #1401 , IS_SINGLE Function updated * Fixes #1401 , IS_SINGLE function updated * Fixes #1401 , IS_SINGLE function updated * Fixes #1401 , tests for phrasified sentences added
* bm25 scoring function updated * Fixes piskvorky#1401 , Phrases behavious now consistent on different versions , test added for empty iterator * Fixes piskvorky#1401 , IS_SINGLE Function updated * Fixes piskvorky#1401 , IS_SINGLE function updated * Fixes piskvorky#1401 , IS_SINGLE function updated * Fixes piskvorky#1401 , tests for phrasified sentences added
Sorry to be a dunce, but this conversation means this issue is now fixed, right? I have version 3.2.0 and still see this odd behavior with phrases. But perhaps I need to ask our systems admin to update, to either re-get version 3.2.0, or get a later version? In the meantime, I will work to implement the temporary fix discussed here. |
@NikkiAdams fix of this issue was released in |
Description
Hi friends -- I've been using gensim 0.13.4.1 to generate bigrams and trigrams (Phrase model), then to train a word2vec model on a stream of sentences run thru the trigrams phrase model. Works great.
But I've tried to upgrade to 1.0.0 and 2.1.0 and the results are way way worse -- with the exact same code. I've scoured the release notes but haven't found an explanation. Don't see deprecation warnings in 0.13.4.1 either.
I should clarify -- on gensim 2.1.0, everything "works", i.e. there are no exceptions. It's just that the answers it gives for similarity are incorrect. The classic man : king :: woman : __ analogy gives "chernow" in the 2.1.0-trained model, but "queen" (the expected answer) in the 0.13.4.1-trained model.
I'm not an expert on any of this, so this could be a silly mistake. (I posted about this in the gitter channel and was directed to open an issue.)
Steps/Code/Corpus to Reproduce
Snippet:
FULL training script (slightly sanitized): https://github.com/jeremybmerrill/ineffable_abracadabra_of_cleaning/blob/master/train_in_vain.py. The only difference it takes to get the different results is changing
requirements.txt
to specify gensim 2.1.0 and re-runningpip install -r requirements.txt
Expected Results
print(model.most_similar_cosmul(positive=['woman', 'king'], negative=['man']))
should have
queen
as the first answer, e.g.[(u'queen', 0.8925966620445251), (u'royal', 0.8144874572753906), (u'wales', 0.7882706522941589), (u'kingston_ontario', 0.7752816677093506), (u'helen_mirren', 0.7718634009361267), (u'bismarck', 0.7668762803077698), (u'princess', 0.7609482407569885), (u'queen_elizabeth_ii', 0.76094651222229), (u'prince_philip', 0.7592762112617493), (u'duchess', 0.7587381601333618)]
this is the result from in 0.13.4.1
Actual Results
gives
chernow
as the first answer and a bunch of other irrelevant stuff,("chernow",0.8686840534210205),("anxiety_disorders",0.8328080177307129),("marlinga",0.8300415277481079), ...
(This is with a model trained in 2.1.0 and calling
model.most_similar_cosmul
in 2.1.0. I haven't tested what happens using a model trained in 0.13.4.1, but loading the model and callingmodel.most_similar_cosmul
in 2.1.0.)Versions
I appreciate any help y'all can offer! (Great software, love it, so much fun, :D etc. )
The text was updated successfully, but these errors were encountered: