Correctly process empty documents in `AuthorTopicModel` #2133

probinso · 2018-07-18T18:04:17Z

This is a fix #1589

initialized empty numpy arrays defualt to dtype=np.float making them ineligible for use as index arrays (which must be of dtype=np.integer or dtype=np.bool)

probinso · 2018-07-25T16:51:11Z

@piskvorky is there anything else I need to do for this pull request?

piskvorky · 2018-07-26T18:12:22Z

gensim/models/atmodel.py

-            cts = np.array([cnt for _, cnt in doc])
+                ids = [id for id, _ in doc]
+            ids = np.array(ids, dtype=np.integer)
+            cts = np.array([cnt for _, cnt in doc], dtype=np.integer)


I'm not familiar with this np.integer type. How does it differ from normal np.int? What's the difference, why use one or the other?

No difference in our case

import numpy as np arr1, arr2 = [1, 2, 3], [] assert np.array(arr1, dtype=np.int).dtype == \ np.array(arr1, dtype=np.integer).dtype == \ np.array(arr2, dtype=np.int).dtype == \ np.array(arr2, dtype=np.integer).dtype

all of it "casted" to int64 on my x64 linux

piskvorky · 2018-07-26T18:13:50Z

It looks good, thanks @probinso . Just a little clarification around np.integer / np.int for my sake please.

Then we wait for @menshikh-iv to get back from holiday, review & merge :)

probinso · 2018-07-30T01:11:54Z

@piskvorky

That is a good question. I'll read through the numpy code. I used what I expected to be the most general correct type. However I can tell that they are different because (np.int is np.integer) == False.

menshikh-iv

Thanks @probinso, please fix current review and I'll merge your PR

menshikh-iv · 2018-07-31T08:42:27Z

gensim/models/atmodel.py

@@ -460,11 +460,12 @@ def inference(self, chunk, author2doc, doc2author, rhot, collect_sstats=False, c
                # make sure the term IDs are ints, otherwise np will get upset
                ids = [int(idx) for idx, _ in doc]
            else:
-                ids = [idx for idx, _ in doc]
-            cts = np.array([cnt for _, cnt in doc])
+                ids = [id for id, _ in doc]


Please revert back idx (id is built-in function name)

menshikh-iv · 2018-07-31T08:45:59Z

gensim/test/test_atmodel.py

@@ -110,6 +109,19 @@ def testBasic(self):
        jill_topics = matutils.sparse2full(jill_topics, model.num_topics)
        self.assertTrue(all(jill_topics > 0))

+    def testEmptyDocument(self):
+        _local_texts = common_texts + [['only_occurs_once_in_corpus_and_alone_in_doc']]


why vars starts from underscore? please remove underscores from start

menshikh-iv · 2018-07-31T08:48:56Z

gensim/test/test_atmodel.py

+        _corpus = [_dictionary.doc2bow(text) for text in _local_texts]
+        _a2d = author2doc.copy()
+        _a2d['joaquin'] = [len(_local_texts) - 1]
+        try:


No need try/except section in test, if test raise unexpected exception - this means that test broken

menshikh-iv · 2018-07-31T08:49:29Z

gensim/test/test_atmodel.py

+        try:
+            _ = self.class_(_corpus, author2doc=_a2d, id2word=_dictionary, num_topics=2)
+        except IndexError:
+            raise IndexError("error occurs in 1.0.0 release tag")


menshikh-iv · 2018-08-02T03:22:42Z

gensim/test/test_atmodel.py

+        a2d['joaquin'] = [len(local_texts) - 1]
+
+        _ = self.class_(corpus, author2doc=a2d, id2word=dictionary, num_topics=2)
+        assert(_)


Better to retrieve vector for any document or corpus (instead of assertion) as "sanity check" action, because _ will be always initialized.

menshikh-iv · 2018-08-02T04:21:42Z

Thanks @probinso, congratz with the first contribution 🥇 !

piskvorky · 2018-08-02T20:42:40Z

I'm still -1 on using np.integer -- what is that and why should we use it, instead of the standard int / np.int?

Unless this change is well-understood, it sounds like a recipe for type-casting and serialization trouble.

menshikh-iv · 2018-08-03T03:09:41Z

@piskvorky fixed in #2145

probinso added 2 commits July 20, 2018 14:36

test for piskvorky#1589

accc625

bugfix piskvorky#1589

e3e47ef

probinso force-pushed the fix_1589 branch from ef5fe5d to e3e47ef Compare July 20, 2018 21:38

Merge branch 'develop' into fix_1589

7b7633d

probinso force-pushed the fix_1589 branch from 5334e95 to 0eb92bc Compare July 20, 2018 23:29

ignore unused assigned varaible

db74531

probinso force-pushed the fix_1589 branch from 0eb92bc to db74531 Compare July 20, 2018 23:29

piskvorky changed the title ~~Fix 1589~~ [WIP] Correctly process empty documents in AuthorTopicModel Jul 21, 2018

piskvorky reviewed Jul 26, 2018

View reviewed changes

menshikh-iv suggested changes Jul 31, 2018

View reviewed changes

PR review

8aa04b2

menshikh-iv reviewed Aug 2, 2018

View reviewed changes

Update test_atmodel.py

ddf8dec

menshikh-iv changed the title ~~[WIP] Correctly process empty documents in AuthorTopicModel~~ Correctly process empty documents in AuthorTopicModel Aug 2, 2018

menshikh-iv merged commit 61728a0 into piskvorky:develop Aug 2, 2018

probinso deleted the fix_1589 branch August 2, 2018 04:32

menshikh-iv mentioned this pull request Aug 3, 2018

Replace np.integer -> np.int in AuthorTopicModel #2145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly process empty documents in `AuthorTopicModel` #2133

Correctly process empty documents in `AuthorTopicModel` #2133

probinso commented Jul 18, 2018 •

edited by menshikh-iv

Loading

probinso commented Jul 25, 2018

piskvorky Jul 26, 2018 •

edited

Loading

menshikh-iv Jul 31, 2018

menshikh-iv Jul 31, 2018

piskvorky commented Jul 26, 2018 •

edited

Loading

probinso commented Jul 30, 2018 •

edited

Loading

menshikh-iv left a comment

menshikh-iv Jul 31, 2018

menshikh-iv Jul 31, 2018

menshikh-iv Jul 31, 2018

menshikh-iv Jul 31, 2018

menshikh-iv Aug 2, 2018

menshikh-iv commented Aug 2, 2018 •

edited

Loading

piskvorky commented Aug 2, 2018 •

edited

Loading

menshikh-iv commented Aug 3, 2018

Correctly process empty documents in AuthorTopicModel #2133

Correctly process empty documents in AuthorTopicModel #2133

Conversation

probinso commented Jul 18, 2018 • edited by menshikh-iv Loading

probinso commented Jul 25, 2018

piskvorky Jul 26, 2018 • edited Loading

Choose a reason for hiding this comment

menshikh-iv Jul 31, 2018

Choose a reason for hiding this comment

menshikh-iv Jul 31, 2018

Choose a reason for hiding this comment

piskvorky commented Jul 26, 2018 • edited Loading

probinso commented Jul 30, 2018 • edited Loading

menshikh-iv left a comment

Choose a reason for hiding this comment

menshikh-iv Jul 31, 2018

Choose a reason for hiding this comment

menshikh-iv Jul 31, 2018

Choose a reason for hiding this comment

menshikh-iv Jul 31, 2018

Choose a reason for hiding this comment

menshikh-iv Jul 31, 2018

Choose a reason for hiding this comment

menshikh-iv Aug 2, 2018

Choose a reason for hiding this comment

menshikh-iv commented Aug 2, 2018 • edited Loading

piskvorky commented Aug 2, 2018 • edited Loading

menshikh-iv commented Aug 3, 2018

Correctly process empty documents in `AuthorTopicModel` #2133

Correctly process empty documents in `AuthorTopicModel` #2133

probinso commented Jul 18, 2018 •

edited by menshikh-iv

Loading

piskvorky Jul 26, 2018 •

edited

Loading

piskvorky commented Jul 26, 2018 •

edited

Loading

probinso commented Jul 30, 2018 •

edited

Loading

menshikh-iv commented Aug 2, 2018 •

edited

Loading

piskvorky commented Aug 2, 2018 •

edited

Loading