Add `dtype` argument for `chunkize_serial`, fix `LdaModel` #2027

darindf · 2018-04-11T18:31:50Z

Reduce the size of data being sent to distributed workers, when as_numpy is true, is defaulting to float64

menshikh-iv · 2018-04-12T04:15:18Z

gensim/models/ldamodel.py

@@ -698,7 +698,8 @@ def rho():
            dirty = False

            reallen = 0
-            for chunk_no, chunk in enumerate(utils.grouper(corpus, chunksize, as_numpy=chunks_as_numpy)):
+            for chunk_no, chunk in enumerate(utils.grouper(corpus, chunksize, as_numpy=chunks_as_numpy,
+                                            dtype=self.dtype)):


Please use hanging indents instead of vertical + fix PE8 here

menshikh-iv · 2018-04-12T04:18:14Z

gensim/utils.py

@@ -1119,7 +1119,7 @@ def substitute_entity(match):
    return RE_HTML_ENTITY.sub(substitute_entity, text)


-def chunkize_serial(iterable, chunksize, as_numpy=False):
+def chunkize_serial(iterable, chunksize, as_numpy=False,dtype=np.float32):


are you sure about default dtype? this will works incorrect if it contains only int values + default value for float (in numpy) is f64 (not f32)

with the new code, this is the dtype of the numpy arrays

wrapped_chunk[0][0].dtype dtype('float32')

with the original code, using numpy dtype default, this is

wrapped_chunk[0][0].dtype dtype('float64')

The iterator itself is returning documents that are list of tuples of type (int,float) (tokenId,tokenFreq) from corpora.MmCorpus

menshikh-iv · 2018-04-12T04:18:37Z

gensim/utils.py

@@ -1148,7 +1148,7 @@ def chunkize_serial(iterable, chunksize, as_numpy=False):
        if as_numpy:
            # convert each document to a 2d numpy array (~6x faster when transmitting
            # chunk data over the wire, in Pyro)
-            wrapped_chunk = [[np.array(doc) for doc in itertools.islice(it, int(chunksize))]]
+            wrapped_chunk = [[np.asarray(doc,dtype=dtype) for doc in itertools.islice(it, int(chunksize))]]


,dtype -> , dtype

piskvorky · 2018-04-12T18:42:43Z

gensim/utils.py

@@ -1148,7 +1148,7 @@ def chunkize_serial(iterable, chunksize, as_numpy=False):
        if as_numpy:
            # convert each document to a 2d numpy array (~6x faster when transmitting
            # chunk data over the wire, in Pyro)
-            wrapped_chunk = [[np.array(doc) for doc in itertools.islice(it, int(chunksize))]]
+            wrapped_chunk = [[np.asarray(doc, dtype=dtype) for doc in itertools.islice(it, int(chunksize))]]


Doesn't this change the existing behaviour? I'm -1 on that, we should keep compatibility. Both in terms of the array copy and dtype.

This is being used as part of LdaState within LdaModel. These function/classes are already parameterized with dtype, so this makes this more consistent with that code base.

Sure, being parametrized is fine. It's a good addition.

But changing the existing semantics (potentially not copying arrays, changing the default float precision) is not fine.

Term frequency should always be an integer value, as you can't have half a word in a document. Not sure why mmcorpus is returning documents as (int,float) while bow returns (int,int)

In both cases this is creating a numpy array, for use in serializing in a distributed environment, which means a copy of the array is being received on the lda_worker process.

Reading asarray documentation, copy=False is being ignored as there is datatype change being made between the source and destination array.

If it makes it clearer, I can change this to

np.array(doc, dtype=dtype)

@darindf yes, this will be better, I agree

piskvorky · 2018-04-12T18:43:56Z

gensim/models/ldamodel.py

@@ -698,7 +698,9 @@ def rho():
            dirty = False

            reallen = 0
-            for chunk_no, chunk in enumerate(utils.grouper(corpus, chunksize, as_numpy=chunks_as_numpy)):
+            for chunk_no, chunk in enumerate(utils.grouper(
+                                                       corpus, chunksize, as_numpy=chunks_as_numpy,


Bad indent. Also, line too long -- best split this long expression into two smaller expressions (first assign utils.grouper, then enumerate).

still not changed

menshikh-iv · 2018-04-21T13:53:39Z

docs/src/corpora/indexedcorpus.rst

@@ -8,9 +8,3 @@
    :undoc-members:
    :show-inheritance:

-
-.. autoclass:: IndexedCorpus


Merge current develop to your PR please (this part will go away, because this is already merged)

I merged the changes, and commited back, but I can't get this message to go away.

menshikh-iv · 2018-04-21T13:54:44Z

gensim/models/ldamodel.py

@@ -698,7 +698,9 @@ def rho():
            dirty = False

            reallen = 0
-            for chunk_no, chunk in enumerate(utils.grouper(corpus, chunksize, as_numpy=chunks_as_numpy)):
+            for chunk_no, chunk in enumerate(utils.grouper(
+                                                       corpus, chunksize, as_numpy=chunks_as_numpy,


still not changed

menshikh-iv · 2018-04-21T13:55:48Z

gensim/utils.py

@@ -1148,7 +1148,7 @@ def chunkize_serial(iterable, chunksize, as_numpy=False):
        if as_numpy:
            # convert each document to a 2d numpy array (~6x faster when transmitting
            # chunk data over the wire, in Pyro)
-            wrapped_chunk = [[np.array(doc) for doc in itertools.islice(it, int(chunksize))]]
+            wrapped_chunk = [[np.asarray(doc, dtype=dtype) for doc in itertools.islice(it, int(chunksize))]]


@darindf yes, this will be better, I agree

This reverts commit cb4900f.

menshikh-iv · 2018-04-30T08:20:38Z

@darindf please fix PEP8 https://travis-ci.org/RaRe-Technologies/gensim/jobs/370273539#L511 and after this PR looks ready to merge.

menshikh-iv · 2018-05-01T05:42:04Z

Thanks @darindf, good work 👍

darindf force-pushed the develop branch from d80c03c to b1f004e Compare April 11, 2018 22:59

menshikh-iv suggested changes Apr 12, 2018

View reviewed changes

piskvorky requested changes Apr 12, 2018

View reviewed changes

menshikh-iv suggested changes Apr 21, 2018

View reviewed changes

Changed to datatype int for distributed

d76f1d6

darindf force-pushed the develop branch 6 times, most recently from fb9b1bb to a924c7a Compare April 23, 2018 19:41

Updated

f66130a

darindf force-pushed the develop branch from a924c7a to f66130a Compare April 23, 2018 19:44

Use dask instead of pyro

cb4900f

darindf force-pushed the develop branch from ee60e40 to cb4900f Compare April 23, 2018 20:12

Revert "Use dask instead of pyro"

a5e4bc4

This reverts commit cb4900f.

Removed space after paren

5a5b300

menshikh-iv changed the title ~~Changed from using floats to ints for doc terms & frequencies~~ Add dtype argument for chunkize_serial, fix LdaModel May 1, 2018

menshikh-iv merged commit 8b81091 into piskvorky:develop May 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `dtype` argument for `chunkize_serial`, fix `LdaModel` #2027

Add `dtype` argument for `chunkize_serial`, fix `LdaModel` #2027

darindf commented Apr 11, 2018

menshikh-iv Apr 12, 2018

menshikh-iv Apr 12, 2018

darindf Apr 12, 2018

menshikh-iv Apr 12, 2018

piskvorky Apr 12, 2018

darindf Apr 12, 2018

piskvorky Apr 12, 2018 •

edited

Loading

darindf Apr 12, 2018

menshikh-iv Apr 21, 2018

piskvorky Apr 12, 2018

menshikh-iv Apr 21, 2018

menshikh-iv Apr 21, 2018

darindf Apr 23, 2018

menshikh-iv Apr 21, 2018

menshikh-iv Apr 21, 2018

menshikh-iv commented Apr 30, 2018

menshikh-iv commented May 1, 2018

Add dtype argument for chunkize_serial, fix LdaModel #2027

Add dtype argument for chunkize_serial, fix LdaModel #2027

Conversation

darindf commented Apr 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Apr 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Apr 30, 2018

menshikh-iv commented May 1, 2018

Add `dtype` argument for `chunkize_serial`, fix `LdaModel` #2027

Add `dtype` argument for `chunkize_serial`, fix `LdaModel` #2027

piskvorky Apr 12, 2018 •

edited

Loading