Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dtype argument for chunkize_serial, fix LdaModel #2027

Merged
merged 5 commits into from
May 1, 2018

Conversation

darindf
Copy link
Contributor

@darindf darindf commented Apr 11, 2018

Reduce the size of data being sent to distributed workers, when as_numpy is true, is defaulting to float64

@@ -698,7 +698,8 @@ def rho():
dirty = False

reallen = 0
for chunk_no, chunk in enumerate(utils.grouper(corpus, chunksize, as_numpy=chunks_as_numpy)):
for chunk_no, chunk in enumerate(utils.grouper(corpus, chunksize, as_numpy=chunks_as_numpy,
dtype=self.dtype)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use hanging indents instead of vertical + fix PE8 here

gensim/utils.py Outdated
@@ -1119,7 +1119,7 @@ def substitute_entity(match):
return RE_HTML_ENTITY.sub(substitute_entity, text)


def chunkize_serial(iterable, chunksize, as_numpy=False):
def chunkize_serial(iterable, chunksize, as_numpy=False,dtype=np.float32):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure about default dtype? this will works incorrect if it contains only int values + default value for float (in numpy) is f64 (not f32)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the new code, this is the dtype of the numpy arrays

wrapped_chunk[0][0].dtype
dtype('float32')

with the original code, using numpy dtype default, this is

wrapped_chunk[0][0].dtype
dtype('float64')

The iterator itself is returning documents that are list of tuples of type (int,float) (tokenId,tokenFreq) from corpora.MmCorpus

gensim/utils.py Outdated
@@ -1148,7 +1148,7 @@ def chunkize_serial(iterable, chunksize, as_numpy=False):
if as_numpy:
# convert each document to a 2d numpy array (~6x faster when transmitting
# chunk data over the wire, in Pyro)
wrapped_chunk = [[np.array(doc) for doc in itertools.islice(it, int(chunksize))]]
wrapped_chunk = [[np.asarray(doc,dtype=dtype) for doc in itertools.islice(it, int(chunksize))]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

,dtype -> , dtype

gensim/utils.py Outdated
@@ -1148,7 +1148,7 @@ def chunkize_serial(iterable, chunksize, as_numpy=False):
if as_numpy:
# convert each document to a 2d numpy array (~6x faster when transmitting
# chunk data over the wire, in Pyro)
wrapped_chunk = [[np.array(doc) for doc in itertools.islice(it, int(chunksize))]]
wrapped_chunk = [[np.asarray(doc, dtype=dtype) for doc in itertools.islice(it, int(chunksize))]]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this change the existing behaviour? I'm -1 on that, we should keep compatibility. Both in terms of the array copy and dtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being used as part of LdaState within LdaModel. These function/classes are already parameterized with dtype, so this makes this more consistent with that code base.

Copy link
Owner

@piskvorky piskvorky Apr 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, being parametrized is fine. It's a good addition.

But changing the existing semantics (potentially not copying arrays, changing the default float precision) is not fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Term frequency should always be an integer value, as you can't have half a word in a document. Not sure why mmcorpus is returning documents as (int,float) while bow returns (int,int)

In both cases this is creating a numpy array, for use in serializing in a distributed environment, which means a copy of the array is being received on the lda_worker process.

Reading asarray documentation, copy=False is being ignored as there is datatype change being made between the source and destination array.

If it makes it clearer, I can change this to

np.array(doc, dtype=dtype)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darindf yes, this will be better, I agree

@@ -698,7 +698,9 @@ def rho():
dirty = False

reallen = 0
for chunk_no, chunk in enumerate(utils.grouper(corpus, chunksize, as_numpy=chunks_as_numpy)):
for chunk_no, chunk in enumerate(utils.grouper(
corpus, chunksize, as_numpy=chunks_as_numpy,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad indent. Also, line too long -- best split this long expression into two smaller expressions (first assign utils.grouper, then enumerate).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not changed

@@ -8,9 +8,3 @@
:undoc-members:
:show-inheritance:


.. autoclass:: IndexedCorpus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge current develop to your PR please (this part will go away, because this is already merged)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged the changes, and commited back, but I can't get this message to go away.

@@ -698,7 +698,9 @@ def rho():
dirty = False

reallen = 0
for chunk_no, chunk in enumerate(utils.grouper(corpus, chunksize, as_numpy=chunks_as_numpy)):
for chunk_no, chunk in enumerate(utils.grouper(
corpus, chunksize, as_numpy=chunks_as_numpy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not changed

gensim/utils.py Outdated
@@ -1148,7 +1148,7 @@ def chunkize_serial(iterable, chunksize, as_numpy=False):
if as_numpy:
# convert each document to a 2d numpy array (~6x faster when transmitting
# chunk data over the wire, in Pyro)
wrapped_chunk = [[np.array(doc) for doc in itertools.islice(it, int(chunksize))]]
wrapped_chunk = [[np.asarray(doc, dtype=dtype) for doc in itertools.islice(it, int(chunksize))]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darindf yes, this will be better, I agree

@darindf darindf force-pushed the develop branch 6 times, most recently from fb9b1bb to a924c7a Compare April 23, 2018 19:41
@menshikh-iv
Copy link
Contributor

@darindf please fix PEP8 https://travis-ci.org/RaRe-Technologies/gensim/jobs/370273539#L511 and after this PR looks ready to merge.

@menshikh-iv
Copy link
Contributor

Thanks @darindf, good work 👍

@menshikh-iv menshikh-iv changed the title Changed from using floats to ints for doc terms & frequencies Add dtype argument for chunkize_serial, fix LdaModel May 1, 2018
@menshikh-iv menshikh-iv merged commit 8b81091 into piskvorky:develop May 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants