-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dtype
argument for chunkize_serial
, fix LdaModel
#2027
Conversation
gensim/models/ldamodel.py
Outdated
@@ -698,7 +698,8 @@ def rho(): | |||
dirty = False | |||
|
|||
reallen = 0 | |||
for chunk_no, chunk in enumerate(utils.grouper(corpus, chunksize, as_numpy=chunks_as_numpy)): | |||
for chunk_no, chunk in enumerate(utils.grouper(corpus, chunksize, as_numpy=chunks_as_numpy, | |||
dtype=self.dtype)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use hanging indents instead of vertical + fix PE8 here
gensim/utils.py
Outdated
@@ -1119,7 +1119,7 @@ def substitute_entity(match): | |||
return RE_HTML_ENTITY.sub(substitute_entity, text) | |||
|
|||
|
|||
def chunkize_serial(iterable, chunksize, as_numpy=False): | |||
def chunkize_serial(iterable, chunksize, as_numpy=False,dtype=np.float32): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure about default dtype? this will works incorrect if it
contains only int
values + default value for float (in numpy) is f64
(not f32
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the new code, this is the dtype of the numpy arrays
wrapped_chunk[0][0].dtype
dtype('float32')
with the original code, using numpy dtype default, this is
wrapped_chunk[0][0].dtype
dtype('float64')
The iterator itself is returning documents that are list of tuples of type (int,float) (tokenId,tokenFreq) from corpora.MmCorpus
gensim/utils.py
Outdated
@@ -1148,7 +1148,7 @@ def chunkize_serial(iterable, chunksize, as_numpy=False): | |||
if as_numpy: | |||
# convert each document to a 2d numpy array (~6x faster when transmitting | |||
# chunk data over the wire, in Pyro) | |||
wrapped_chunk = [[np.array(doc) for doc in itertools.islice(it, int(chunksize))]] | |||
wrapped_chunk = [[np.asarray(doc,dtype=dtype) for doc in itertools.islice(it, int(chunksize))]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
,dtype
-> , dtype
gensim/utils.py
Outdated
@@ -1148,7 +1148,7 @@ def chunkize_serial(iterable, chunksize, as_numpy=False): | |||
if as_numpy: | |||
# convert each document to a 2d numpy array (~6x faster when transmitting | |||
# chunk data over the wire, in Pyro) | |||
wrapped_chunk = [[np.array(doc) for doc in itertools.islice(it, int(chunksize))]] | |||
wrapped_chunk = [[np.asarray(doc, dtype=dtype) for doc in itertools.islice(it, int(chunksize))]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this change the existing behaviour? I'm -1 on that, we should keep compatibility. Both in terms of the array copy
and dtype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being used as part of LdaState within LdaModel. These function/classes are already parameterized with dtype, so this makes this more consistent with that code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, being parametrized is fine. It's a good addition.
But changing the existing semantics (potentially not copying arrays, changing the default float precision) is not fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Term frequency should always be an integer value, as you can't have half a word in a document. Not sure why mmcorpus is returning documents as (int,float) while bow returns (int,int)
In both cases this is creating a numpy array, for use in serializing in a distributed environment, which means a copy of the array is being received on the lda_worker process.
Reading asarray documentation, copy=False
is being ignored as there is datatype change being made between the source and destination array.
If it makes it clearer, I can change this to
np.array(doc, dtype=dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@darindf yes, this will be better, I agree
gensim/models/ldamodel.py
Outdated
@@ -698,7 +698,9 @@ def rho(): | |||
dirty = False | |||
|
|||
reallen = 0 | |||
for chunk_no, chunk in enumerate(utils.grouper(corpus, chunksize, as_numpy=chunks_as_numpy)): | |||
for chunk_no, chunk in enumerate(utils.grouper( | |||
corpus, chunksize, as_numpy=chunks_as_numpy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad indent. Also, line too long -- best split this long expression into two smaller expressions (first assign utils.grouper
, then enumerate
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not changed
docs/src/corpora/indexedcorpus.rst
Outdated
@@ -8,9 +8,3 @@ | |||
:undoc-members: | |||
:show-inheritance: | |||
|
|||
|
|||
.. autoclass:: IndexedCorpus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge current develop
to your PR please (this part will go away, because this is already merged)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged the changes, and commited back, but I can't get this message to go away.
gensim/models/ldamodel.py
Outdated
@@ -698,7 +698,9 @@ def rho(): | |||
dirty = False | |||
|
|||
reallen = 0 | |||
for chunk_no, chunk in enumerate(utils.grouper(corpus, chunksize, as_numpy=chunks_as_numpy)): | |||
for chunk_no, chunk in enumerate(utils.grouper( | |||
corpus, chunksize, as_numpy=chunks_as_numpy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not changed
gensim/utils.py
Outdated
@@ -1148,7 +1148,7 @@ def chunkize_serial(iterable, chunksize, as_numpy=False): | |||
if as_numpy: | |||
# convert each document to a 2d numpy array (~6x faster when transmitting | |||
# chunk data over the wire, in Pyro) | |||
wrapped_chunk = [[np.array(doc) for doc in itertools.islice(it, int(chunksize))]] | |||
wrapped_chunk = [[np.asarray(doc, dtype=dtype) for doc in itertools.islice(it, int(chunksize))]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@darindf yes, this will be better, I agree
fb9b1bb
to
a924c7a
Compare
This reverts commit cb4900f.
@darindf please fix PEP8 https://travis-ci.org/RaRe-Technologies/gensim/jobs/370273539#L511 and after this PR looks ready to merge. |
Thanks @darindf, good work 👍 |
dtype
argument for chunkize_serial
, fix LdaModel
Reduce the size of data being sent to distributed workers, when as_numpy is true, is defaulting to float64