-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly process empty documents in AuthorTopicModel
#2133
Conversation
@piskvorky is there anything else I need to do for this pull request? |
gensim/models/atmodel.py
Outdated
cts = np.array([cnt for _, cnt in doc]) | ||
ids = [id for id, _ in doc] | ||
ids = np.array(ids, dtype=np.integer) | ||
cts = np.array([cnt for _, cnt in doc], dtype=np.integer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with this np.integer
type. How does it differ from normal np.int
? What's the difference, why use one or the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No difference in our case
import numpy as np
arr1, arr2 = [1, 2, 3], []
assert np.array(arr1, dtype=np.int).dtype == \
np.array(arr1, dtype=np.integer).dtype == \
np.array(arr2, dtype=np.int).dtype == \
np.array(arr2, dtype=np.integer).dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of it "casted" to int64
on my x64 linux
It looks good, thanks @probinso . Just a little clarification around Then we wait for @menshikh-iv to get back from holiday, review & merge :) |
That is a good question. I'll read through the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @probinso, please fix current review and I'll merge your PR
gensim/models/atmodel.py
Outdated
@@ -460,11 +460,12 @@ def inference(self, chunk, author2doc, doc2author, rhot, collect_sstats=False, c | |||
# make sure the term IDs are ints, otherwise np will get upset | |||
ids = [int(idx) for idx, _ in doc] | |||
else: | |||
ids = [idx for idx, _ in doc] | |||
cts = np.array([cnt for _, cnt in doc]) | |||
ids = [id for id, _ in doc] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert back idx
(id
is built-in function name)
gensim/test/test_atmodel.py
Outdated
@@ -110,6 +109,19 @@ def testBasic(self): | |||
jill_topics = matutils.sparse2full(jill_topics, model.num_topics) | |||
self.assertTrue(all(jill_topics > 0)) | |||
|
|||
def testEmptyDocument(self): | |||
_local_texts = common_texts + [['only_occurs_once_in_corpus_and_alone_in_doc']] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why vars starts from underscore? please remove underscores from start
gensim/test/test_atmodel.py
Outdated
_corpus = [_dictionary.doc2bow(text) for text in _local_texts] | ||
_a2d = author2doc.copy() | ||
_a2d['joaquin'] = [len(_local_texts) - 1] | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need try/except section in test, if test raise unexpected exception - this means that test broken
gensim/test/test_atmodel.py
Outdated
try: | ||
_ = self.class_(_corpus, author2doc=_a2d, id2word=_dictionary, num_topics=2) | ||
except IndexError: | ||
raise IndexError("error occurs in 1.0.0 release tag") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
???
gensim/test/test_atmodel.py
Outdated
a2d['joaquin'] = [len(local_texts) - 1] | ||
|
||
_ = self.class_(corpus, author2doc=a2d, id2word=dictionary, num_topics=2) | ||
assert(_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to retrieve vector for any document or corpus (instead of assertion) as "sanity check" action, because _
will be always initialized.
AuthorTopicModel
Thanks @probinso, congratz with the first contribution 🥇 ! |
I'm still -1 on using Unless this change is well-understood, it sounds like a recipe for type-casting and serialization trouble. |
@piskvorky fixed in #2145 |
This is a fix #1589
initialized empty numpy arrays defualt to
dtype=np.float
making them ineligible for use as index arrays (which must be ofdtype=np.integer
ordtype=np.bool
)