Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix empty output bug in Phrases. Fix #1401 #1853

Merged
merged 8 commits into from
Feb 15, 2018

Conversation

sj29-innovate
Copy link
Contributor

@sj29-innovate sj29-innovate commented Jan 23, 2018

The "is_single" function previously returned iterator in case the object was a corpus and was followed by self._apply() which again made an iterator , it works fine now with this fix.

(fixes #1401)

@sj29-innovate sj29-innovate changed the title Fixes #1401 , Phrases behaviousr now consistent on different versions , tests added for empty iterator Fixes #1401 , Phrases behaviour now consistent on different versions , tests added for empty iterator Jan 23, 2018
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also want to see a concrete example that doesn't work correctly earlier, but now works fine (because we can't reproduce it properly in unittest).

@@ -115,7 +115,7 @@ def _is_single(obj):
return True, obj_iter
else:
# If the first item isn't a string, assume obj is a corpus
return False, obj_iter
return False, list(obj_iter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obj_iter can be very large -> we can receive out-of-memory here.

Copy link
Contributor Author

@sj29-innovate sj29-innovate Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also works if I return just obj instead of list(obj_iter) but it fails if the obj is an iterator such as iter([ [] , [] ]) , so I am thinking of adding a special case to handle such iterators because some of the unit tests were failing and else otherwise just return the obj.
Proposed change ,

    obj_iter = iter(obj)
    temp_iter = obj_iter
    try:
        peek = next(obj_iter)
        obj_iter = it.chain([peek], obj_iter)
    except StopIteration:
        # An empty object is a single document 
        return True, obj
    if isinstance(peek, string_types) :
        # It's a document, return the iterator
        return True, obj_iter
    if temp_iter == obj :
        #Checking for an iterator to the object
    	return False , list(obj_iter)
    else :
        # If the first item isn't a string, assume obj is a corpus
        return False, obj_iter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv What do you think about the above proposed changes ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sj29-innovate anyway, list(...) unacceptable here.

Copy link
Contributor Author

@sj29-innovate sj29-innovate Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv Yes , I agree that is not a very good way to do that . This works for me without using list(...),

    obj_iter = iter(obj)
    temp_iter = obj_iter
    try:
        peek = next(obj_iter)
        obj_iter = it.chain([peek], obj_iter)
    except StopIteration:
        # An empty object is a single document
        return True, obj
    if isinstance(peek, string_types):
        # It's a document, return the iterator
        return True, obj_iter
    if temp_iter == obj :
        #Checking for iterator to the object    
    	return False , obj_iter
    else:
        # If the first item isn't a string, assume obj is a corpus
        return False, obj

I realised there was no use of using the list(...), works well with the above change.

@sj29-innovate
Copy link
Contributor Author

Here are some of the results i am getting with my proposed changes . I am using eng_news_2015_30K.tar.gz corpus and the script provided here https://github.com/jeremybmerrill/ineffable_abracadabra_of_cleaning/blob/master/train_in_vain.py for my testing.

Comparing the output of " print(model.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])) " before and after the changes on latest version ->

Without Changes ->

[(u'berkley', 0.8121861219406128), (u'two_thirds', 0.8113135099411011), (u'quizzed', 0.8031861782073975), (u'kade', 0.7986487150192261), (u'ssi', 0.7958680391311646), (u'reversing', 0.7934471368789673), (u'home_run', 0.7916820645332336), (u'navajo', 0.7898678779602051), (u'implementation_plan', 0.7837337255477905), (u'send', 0.7834026217460632)]
train() was called with empty iterator

With the changes ->

[(u'first_lady', 0.93344646692276), (u'mrs', 0.9223362803459167), (u'ms', 0.9138304591178894), (u'jane', 0.9107893109321594), (u'kate', 0.9079267978668213), (u'catherine', 0.9071255922317505), (u'anne', 0.9063032269477844), (u'queen', 0.9023390412330627), (u'maria', 0.9020583033561707), (u'elizabeth', 0.8988279104232788)]

@menshikh-iv
Copy link
Contributor

@sj29-innovate can you make more "focused" example (without word2vec), only with phrases?

@sj29-innovate
Copy link
Contributor Author

sj29-innovate commented Jan 24, 2018

@menshikh-iv
Please have a look at this observation , "is_single" function returns an object of type "itertools.chain" as per the current version which is a "generator" and therefore gets exhausted after one iteration and gets emptied for the further computation. Therefore any use case , which involves more than one iteration of such an object will fail , that is why word2vec training also failed because it involves multiple iterations of it.

I encountered this when i ran two loops over the same object returned by "is_single" one after the another and the second loop didnt work.

My test case was as follows ,

test_sents = [
  [ 'I','did','not'],
  ['I','can','not']
]

corpus = trigrams_model[bigrams_model[test_sents]]
for sent in corpus :
  print sent
for sent in corpus :
  print sent

Output ->

[u'I', u'did_not']
[u'I', u'can_not']

Expected Output->

[u'I', u'did_not']
[u'I', u'can_not']
[u'I', u'did_not']
[u'I', u'can_not']

@menshikh-iv
Copy link
Contributor

@sj29-innovate yes, almost what I wanted to see, please make it executable (all imports, fits Phrases, dataset), I want to copy-paste it and run with different gensim version.

@sj29-innovate
Copy link
Contributor Author

sj29-innovate commented Jan 25, 2018

@menshikh-iv Here is the executable code for the above testcase. Added brown corpus for training for a better phrases behaviour.

from gensim.models.phrases import Phrases,Phraser
from nltk.corpus import brown

test_sents = [
  [ 'I','did','not'],
  ['I','can','not']
]

bigrams_phrases = Phrases(brown.sents())
bigrams_model = Phraser(bigrams_phrases)
trigrams_phrases = Phrases(bigrams_model[brown.sents()])
trigrams_model = Phraser(trigrams_phrases)
model = trigrams_model[bigrams_model[test_sents]]

print "Printing in First Loop\n"
for sent in model :
  print sent

print "\nPrinting in Second Loop\n"
for sent in model :
  print sent

@sj29-innovate
Copy link
Contributor Author

@menshikh-iv I believe we must look at generator type check in word2vec which failed to recognize "itertools.chain" and therefore ended up training on empty corpus. If the changes to "is_single" function seems fine to you , I can commit them.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 30, 2018

@sj29-innovate can you look into this function https://github.com/RaRe-Technologies/gensim/blob/255ce25903c2521ad6c92c00875a7dc3aa88fe7d/gensim/utils.py#L797 looks very similar to the current case.

@sj29-innovate
Copy link
Contributor Author

sj29-innovate commented Jan 30, 2018

@menshikh-iv The problem I figured out was that it was not possible to do multiple iterations of the object we got from the Phraser output as it was a "generator" object , Please look at the following snippet

PhraseObject = trigrams_model[bigrams_model[sentences]]
print type(PhraseObject)
print type(PhraseObject.corpus)

Output

<class 'gensim.interfaces.TransformedCorpus'>
<type 'itertools.chain'>

__iter__ function for class TransformedCorpus ->

def __iter__(self):
        if self.chunksize:
            for chunk in utils.grouper(self.corpus, self.chunksize):
                for transformed in self.obj.__getitem__(chunk, chunksize=None):
                    yield transformed
        else:
            for doc in self.corpus:
                yield self.obj[doc]

So basically the self.corpus is an itertools.chain object which is getting emptied when we iterate through it.

@menshikh-iv
Copy link
Contributor

@sj29-innovate I understand what you mean, can you push your last changes?

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like almost done, please make last changes.

bigram_phraser = Phraser(bigram_phrases)
trigram_phrases = Phrases(bigram_phraser[self.sentences])
trigram_phraser = Phraser(trigram_phrases)
self.assertNotEqual(trigram_phraser[bigram_phraser[self.sentences]].__len__(), 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to make it more similar with example in PR

trigrams = trigram_phraser[bigram_phraser[self.sentences]]
fst, snd = list(trigrams), list(trigrams)
self.assertEqual(fst, snd)
self.assertNotEqual(snd, [])

@menshikh-iv
Copy link
Contributor

@sj29-innovate good work: +1:
@gojomo can you look into (to be sure that I didn't miss anything)?

@sj29-innovate
Copy link
Contributor Author

@menshikh-iv Thanks , can you suggest some further issues once it's reviewed by Gordon.

@menshikh-iv
Copy link
Contributor

@sj29-innovate of course!

@menshikh-iv menshikh-iv changed the title Fixes #1401 , Phrases behaviour now consistent on different versions , tests added for empty iterator Fix bug in Phrases. Fix #1401 Feb 1, 2018
@sj29-innovate
Copy link
Contributor Author

@menshikh-iv Can you already suggest me some other issue I can start working on. Thanks.

@menshikh-iv
Copy link
Contributor

@sj29-innovate try to fix 1869 (fresh-investigated bug in MmCorpus)

@menshikh-iv
Copy link
Contributor

@gojomo ping

@gojomo
Copy link
Collaborator

gojomo commented Feb 5, 2018

I think the preexisting API - using []-indexing to do something different to a single-case or an iterable, and the _is_single() peek-testing – is inherently confusing & error-prone and should be eliminated entirely.

With regard to a minimal fix for the specific issue in #1401, if the test case works in a suitably-old gensim (0.13.4.1), then breaks in a newer version (1.0.0 through 2.1.0) – matching the #1401 report – then works again after the fix, that's a good sign.

It'd be even better if the exact commit that broke the behavior was identified, and the exact person who made that commit (or understood the cases _is_single() was supposed to handle) reviewed this to see if it was the original intent. I think it's 34759be by @ELind77 & committed by @tmylk.

I'm suspicious this may fail too-subtly in the common mistake-case where someone provides an iterator object, and not a restartable iterable object - it'll fail to restore the peeked object, and it will only be usable for a single iteration (possibly resulting in other subtle issues down-the-line, if other training expects re-iteration).

@menshikh-iv
Copy link
Contributor

@gojomo so, you propose to completely remove is_single here ?

@gojomo
Copy link
Collaborator

gojomo commented Feb 10, 2018

I’d discard the practice of using []-indexing as the interface to stream/generator-phrasification – a compatibility-breaking change. Then there’d be no reason for confusing _is_single()-like behavior-switching.

@menshikh-iv
Copy link
Contributor

@gojomo, unfortunately, this is almost impossible now (too breaking change now, also we broke very stable API).

@sj29-innovate
Copy link
Contributor Author

@menshikh-iv What are the further requirements here ?

@menshikh-iv menshikh-iv changed the title Fix bug in Phrases. Fix #1401 Fix empty output bug in Phrases. Fix #1401 Feb 15, 2018
@menshikh-iv
Copy link
Contributor

@sj29-innovate for the current state - that's all that we can do here (without any breaking changes), thank you @sj29-innovate!

@menshikh-iv menshikh-iv merged commit 2722744 into piskvorky:develop Feb 15, 2018
@sj29-innovate sj29-innovate deleted the new-develop branch February 15, 2018 12:20
sj29-innovate added a commit to sj29-innovate/gensim that referenced this pull request Feb 21, 2018
* bm25 scoring function updated

* Fixes piskvorky#1401 , Phrases behavious now consistent on different versions , test added for empty iterator

* Fixes piskvorky#1401 , IS_SINGLE Function updated

* Fixes piskvorky#1401 , IS_SINGLE function updated

* Fixes piskvorky#1401 , IS_SINGLE function updated

* Fixes piskvorky#1401 , tests for phrasified sentences added
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

different Phrases behavior for same code/corpus between 0.13.4.1 and 2.1.0
3 participants