Fix empty output bug in `Phrases`. Fix #1401 #1853

sj29-innovate · 2018-01-23T22:35:00Z

The "is_single" function previously returned iterator in case the object was a corpus and was followed by self._apply() which again made an iterator , it works fine now with this fix.

(fixes #1401)

…into develop

…into develop Conflicts: gensim/summarization/bm25.py

…versions , test added for empty iterator

menshikh-iv

I also want to see a concrete example that doesn't work correctly earlier, but now works fine (because we can't reproduce it properly in unittest).

menshikh-iv · 2018-01-24T07:44:19Z

gensim/models/phrases.py

@@ -115,7 +115,7 @@ def _is_single(obj):
        return True, obj_iter
    else:
        # If the first item isn't a string, assume obj is a corpus
-        return False, obj_iter
+        return False, list(obj_iter)


obj_iter can be very large -> we can receive out-of-memory here.

It also works if I return just obj instead of list(obj_iter) but it fails if the obj is an iterator such as iter([ [] , [] ]) , so I am thinking of adding a special case to handle such iterators because some of the unit tests were failing and else otherwise just return the obj.
Proposed change ,

obj_iter = iter(obj) temp_iter = obj_iter try: peek = next(obj_iter) obj_iter = it.chain([peek], obj_iter) except StopIteration: # An empty object is a single document return True, obj if isinstance(peek, string_types) : # It's a document, return the iterator return True, obj_iter if temp_iter == obj : #Checking for an iterator to the object return False , list(obj_iter) else : # If the first item isn't a string, assume obj is a corpus return False, obj_iter

@menshikh-iv What do you think about the above proposed changes ?

@sj29-innovate anyway, list(...) unacceptable here.

@menshikh-iv Yes , I agree that is not a very good way to do that . This works for me without using list(...),

obj_iter = iter(obj) temp_iter = obj_iter try: peek = next(obj_iter) obj_iter = it.chain([peek], obj_iter) except StopIteration: # An empty object is a single document return True, obj if isinstance(peek, string_types): # It's a document, return the iterator return True, obj_iter if temp_iter == obj : #Checking for iterator to the object return False , obj_iter else: # If the first item isn't a string, assume obj is a corpus return False, obj

I realised there was no use of using the list(...), works well with the above change.

sj29-innovate · 2018-01-24T11:11:39Z

Here are some of the results i am getting with my proposed changes . I am using eng_news_2015_30K.tar.gz corpus and the script provided here https://github.com/jeremybmerrill/ineffable_abracadabra_of_cleaning/blob/master/train_in_vain.py for my testing.

Comparing the output of " print(model.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])) " before and after the changes on latest version ->

Without Changes ->

[(u'berkley', 0.8121861219406128), (u'two_thirds', 0.8113135099411011), (u'quizzed', 0.8031861782073975), (u'kade', 0.7986487150192261), (u'ssi', 0.7958680391311646), (u'reversing', 0.7934471368789673), (u'home_run', 0.7916820645332336), (u'navajo', 0.7898678779602051), (u'implementation_plan', 0.7837337255477905), (u'send', 0.7834026217460632)]
train() was called with empty iterator

With the changes ->

[(u'first_lady', 0.93344646692276), (u'mrs', 0.9223362803459167), (u'ms', 0.9138304591178894), (u'jane', 0.9107893109321594), (u'kate', 0.9079267978668213), (u'catherine', 0.9071255922317505), (u'anne', 0.9063032269477844), (u'queen', 0.9023390412330627), (u'maria', 0.9020583033561707), (u'elizabeth', 0.8988279104232788)]

menshikh-iv · 2018-01-24T11:39:04Z

@sj29-innovate can you make more "focused" example (without word2vec), only with phrases?

sj29-innovate · 2018-01-24T18:47:28Z

@menshikh-iv
Please have a look at this observation , "is_single" function returns an object of type "itertools.chain" as per the current version which is a "generator" and therefore gets exhausted after one iteration and gets emptied for the further computation. Therefore any use case , which involves more than one iteration of such an object will fail , that is why word2vec training also failed because it involves multiple iterations of it.

I encountered this when i ran two loops over the same object returned by "is_single" one after the another and the second loop didnt work.

My test case was as follows ,

test_sents = [
  [ 'I','did','not'],
  ['I','can','not']
]

corpus = trigrams_model[bigrams_model[test_sents]]
for sent in corpus :
  print sent
for sent in corpus :
  print sent

Output ->

[u'I', u'did_not']
[u'I', u'can_not']

Expected Output->

[u'I', u'did_not']
[u'I', u'can_not']
[u'I', u'did_not']
[u'I', u'can_not']

menshikh-iv · 2018-01-25T04:12:18Z

@sj29-innovate yes, almost what I wanted to see, please make it executable (all imports, fits Phrases, dataset), I want to copy-paste it and run with different gensim version.

sj29-innovate · 2018-01-25T04:42:52Z

@menshikh-iv Here is the executable code for the above testcase. Added brown corpus for training for a better phrases behaviour.

from gensim.models.phrases import Phrases,Phraser
from nltk.corpus import brown

test_sents = [
  [ 'I','did','not'],
  ['I','can','not']
]

bigrams_phrases = Phrases(brown.sents())
bigrams_model = Phraser(bigrams_phrases)
trigrams_phrases = Phrases(bigrams_model[brown.sents()])
trigrams_model = Phraser(trigrams_phrases)
model = trigrams_model[bigrams_model[test_sents]]

print "Printing in First Loop\n"
for sent in model :
  print sent

print "\nPrinting in Second Loop\n"
for sent in model :
  print sent

sj29-innovate · 2018-01-29T18:16:26Z

@menshikh-iv I believe we must look at generator type check in word2vec which failed to recognize "itertools.chain" and therefore ended up training on empty corpus. If the changes to "is_single" function seems fine to you , I can commit them.

menshikh-iv · 2018-01-30T09:35:51Z

@sj29-innovate can you look into this function https://github.com/RaRe-Technologies/gensim/blob/255ce25903c2521ad6c92c00875a7dc3aa88fe7d/gensim/utils.py#L797 looks very similar to the current case.

sj29-innovate · 2018-01-30T10:09:14Z

@menshikh-iv The problem I figured out was that it was not possible to do multiple iterations of the object we got from the Phraser output as it was a "generator" object , Please look at the following snippet

PhraseObject = trigrams_model[bigrams_model[sentences]]
print type(PhraseObject)
print type(PhraseObject.corpus)

Output

<class 'gensim.interfaces.TransformedCorpus'>
<type 'itertools.chain'>

__iter__ function for class TransformedCorpus ->

def __iter__(self):
        if self.chunksize:
            for chunk in utils.grouper(self.corpus, self.chunksize):
                for transformed in self.obj.__getitem__(chunk, chunksize=None):
                    yield transformed
        else:
            for doc in self.corpus:
                yield self.obj[doc]

So basically the self.corpus is an itertools.chain object which is getting emptied when we iterate through it.

menshikh-iv · 2018-01-31T05:11:35Z

@sj29-innovate I understand what you mean, can you push your last changes?

menshikh-iv

Looks like almost done, please make last changes.

menshikh-iv · 2018-01-31T06:05:09Z

gensim/test/test_phrases.py

+        bigram_phraser = Phraser(bigram_phrases)
+        trigram_phrases = Phrases(bigram_phraser[self.sentences])
+        trigram_phraser = Phraser(trigram_phrases)
+        self.assertNotEqual(trigram_phraser[bigram_phraser[self.sentences]].__len__(), 0)


Maybe it's better to make it more similar with example in PR

trigrams = trigram_phraser[bigram_phraser[self.sentences]] fst, snd = list(trigrams), list(trigrams) self.assertEqual(fst, snd) self.assertNotEqual(snd, [])

menshikh-iv · 2018-01-31T08:30:53Z

@sj29-innovate good work: +1:
@gojomo can you look into (to be sure that I didn't miss anything)?

sj29-innovate · 2018-01-31T08:40:15Z

@menshikh-iv Thanks , can you suggest some further issues once it's reviewed by Gordon.

menshikh-iv · 2018-01-31T08:42:53Z

@sj29-innovate of course!

sj29-innovate · 2018-02-02T07:14:59Z

@menshikh-iv Can you already suggest me some other issue I can start working on. Thanks.

menshikh-iv · 2018-02-02T10:03:09Z

@sj29-innovate try to fix 1869 (fresh-investigated bug in MmCorpus)

menshikh-iv · 2018-02-05T19:22:02Z

@gojomo ping

gojomo · 2018-02-05T20:18:34Z

I think the preexisting API - using []-indexing to do something different to a single-case or an iterable, and the _is_single() peek-testing – is inherently confusing & error-prone and should be eliminated entirely.

With regard to a minimal fix for the specific issue in #1401, if the test case works in a suitably-old gensim (0.13.4.1), then breaks in a newer version (1.0.0 through 2.1.0) – matching the #1401 report – then works again after the fix, that's a good sign.

It'd be even better if the exact commit that broke the behavior was identified, and the exact person who made that commit (or understood the cases _is_single() was supposed to handle) reviewed this to see if it was the original intent. I think it's 34759be by @ELind77 & committed by @tmylk.

I'm suspicious this may fail too-subtly in the common mistake-case where someone provides an iterator object, and not a restartable iterable object - it'll fail to restore the peeked object, and it will only be usable for a single iteration (possibly resulting in other subtle issues down-the-line, if other training expects re-iteration).

menshikh-iv · 2018-02-09T09:47:42Z

@gojomo so, you propose to completely remove is_single here ?

gojomo · 2018-02-10T02:30:40Z

I’d discard the practice of using []-indexing as the interface to stream/generator-phrasification – a compatibility-breaking change. Then there’d be no reason for confusing _is_single()-like behavior-switching.

menshikh-iv · 2018-02-12T06:22:54Z

@gojomo, unfortunately, this is almost impossible now (too breaking change now, also we broke very stable API).

sj29-innovate · 2018-02-14T19:51:19Z

@menshikh-iv What are the further requirements here ?

menshikh-iv · 2018-02-15T11:37:45Z

@sj29-innovate for the current state - that's all that we can do here (without any breaking changes), thank you @sj29-innovate!

* bm25 scoring function updated * Fixes piskvorky#1401 , Phrases behavious now consistent on different versions , test added for empty iterator * Fixes piskvorky#1401 , IS_SINGLE Function updated * Fixes piskvorky#1401 , IS_SINGLE function updated * Fixes piskvorky#1401 , IS_SINGLE function updated * Fixes piskvorky#1401 , tests for phrasified sentences added

sj29-innovate added 4 commits January 8, 2018 13:26

bm25 scoring function updated

f9803cb

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

ef8e918

…into develop

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

33a88f4

…into develop Conflicts: gensim/summarization/bm25.py

Fixes piskvorky#1401 , Phrases behavious now consistent on different …

d9b098a

…versions , test added for empty iterator

sj29-innovate changed the title ~~Fixes #1401 , Phrases behaviousr now consistent on different versions , tests added for empty iterator~~ Fixes #1401 , Phrases behaviour now consistent on different versions , tests added for empty iterator Jan 23, 2018

menshikh-iv suggested changes Jan 24, 2018

View reviewed changes

sj29-innovate added 3 commits January 31, 2018 11:04

Fixes piskvorky#1401 , IS_SINGLE Function updated

c25bb4b

Fixes piskvorky#1401 , IS_SINGLE function updated

f23ef6a

Fixes piskvorky#1401 , IS_SINGLE function updated

dfdd16e

menshikh-iv suggested changes Jan 31, 2018

View reviewed changes

Fixes piskvorky#1401 , tests for phrasified sentences added

3d26f32

menshikh-iv changed the title ~~Fixes #1401 , Phrases behaviour now consistent on different versions , tests added for empty iterator~~ Fix bug in Phrases. Fix #1401 Feb 1, 2018

menshikh-iv changed the title ~~Fix bug in Phrases. Fix #1401~~ Fix empty output bug in Phrases. Fix #1401 Feb 15, 2018

menshikh-iv mentioned this pull request Feb 15, 2018

different Phrases behavior for same code/corpus between 0.13.4.1 and 2.1.0 #1401

Closed

menshikh-iv merged commit 2722744 into piskvorky:develop Feb 15, 2018

sj29-innovate deleted the new-develop branch February 15, 2018 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix empty output bug in `Phrases`. Fix #1401 #1853

Fix empty output bug in `Phrases`. Fix #1401 #1853

sj29-innovate commented Jan 23, 2018 •

edited by gojomo

Loading

menshikh-iv left a comment

menshikh-iv Jan 24, 2018

sj29-innovate Jan 24, 2018 •

edited by menshikh-iv

Loading

sj29-innovate Jan 25, 2018

menshikh-iv Jan 26, 2018

sj29-innovate Jan 26, 2018 •

edited by menshikh-iv

Loading

sj29-innovate commented Jan 24, 2018

menshikh-iv commented Jan 24, 2018

sj29-innovate commented Jan 24, 2018 •

edited

Loading

menshikh-iv commented Jan 25, 2018

sj29-innovate commented Jan 25, 2018 •

edited by menshikh-iv

Loading

sj29-innovate commented Jan 29, 2018

menshikh-iv commented Jan 30, 2018 •

edited

Loading

sj29-innovate commented Jan 30, 2018 •

edited by menshikh-iv

Loading

menshikh-iv commented Jan 31, 2018

menshikh-iv left a comment

menshikh-iv Jan 31, 2018

menshikh-iv commented Jan 31, 2018

sj29-innovate commented Jan 31, 2018

menshikh-iv commented Jan 31, 2018

sj29-innovate commented Feb 2, 2018

menshikh-iv commented Feb 2, 2018

menshikh-iv commented Feb 5, 2018

gojomo commented Feb 5, 2018

menshikh-iv commented Feb 9, 2018

gojomo commented Feb 10, 2018

menshikh-iv commented Feb 12, 2018

sj29-innovate commented Feb 14, 2018

menshikh-iv commented Feb 15, 2018

Fix empty output bug in Phrases. Fix #1401 #1853

Fix empty output bug in Phrases. Fix #1401 #1853

Conversation

sj29-innovate commented Jan 23, 2018 • edited by gojomo Loading

menshikh-iv left a comment

Choose a reason for hiding this comment

menshikh-iv Jan 24, 2018

Choose a reason for hiding this comment

sj29-innovate Jan 24, 2018 • edited by menshikh-iv Loading

Choose a reason for hiding this comment

sj29-innovate Jan 25, 2018

Choose a reason for hiding this comment

menshikh-iv Jan 26, 2018

Choose a reason for hiding this comment

sj29-innovate Jan 26, 2018 • edited by menshikh-iv Loading

Choose a reason for hiding this comment

sj29-innovate commented Jan 24, 2018

menshikh-iv commented Jan 24, 2018

sj29-innovate commented Jan 24, 2018 • edited Loading

menshikh-iv commented Jan 25, 2018

sj29-innovate commented Jan 25, 2018 • edited by menshikh-iv Loading

sj29-innovate commented Jan 29, 2018

menshikh-iv commented Jan 30, 2018 • edited Loading

sj29-innovate commented Jan 30, 2018 • edited by menshikh-iv Loading

menshikh-iv commented Jan 31, 2018

menshikh-iv left a comment

Choose a reason for hiding this comment

menshikh-iv Jan 31, 2018

Choose a reason for hiding this comment

menshikh-iv commented Jan 31, 2018

sj29-innovate commented Jan 31, 2018

menshikh-iv commented Jan 31, 2018

sj29-innovate commented Feb 2, 2018

menshikh-iv commented Feb 2, 2018

menshikh-iv commented Feb 5, 2018

gojomo commented Feb 5, 2018

menshikh-iv commented Feb 9, 2018

gojomo commented Feb 10, 2018

menshikh-iv commented Feb 12, 2018

sj29-innovate commented Feb 14, 2018

menshikh-iv commented Feb 15, 2018

Fix empty output bug in `Phrases`. Fix #1401 #1853

Fix empty output bug in `Phrases`. Fix #1401 #1853

sj29-innovate commented Jan 23, 2018 •

edited by gojomo

Loading

sj29-innovate Jan 24, 2018 •

edited by menshikh-iv

Loading

sj29-innovate Jan 26, 2018 •

edited by menshikh-iv

Loading

sj29-innovate commented Jan 24, 2018 •

edited

Loading

sj29-innovate commented Jan 25, 2018 •

edited by menshikh-iv

Loading

menshikh-iv commented Jan 30, 2018 •

edited

Loading

sj29-innovate commented Jan 30, 2018 •

edited by menshikh-iv

Loading