-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Cython version of MmReader
#1825
Conversation
This is great! Speeding up the "basic blocks" (MmCorpus included) is a much needed direction. @arlenk Are there any cons to this change? What are the disadvantages, trade-offs? @menshikh-iv shall we keep the old "pure-Python" version, or do we drop the slow versions and keep only the compiled code path? I'm +1 for dropping, now that we offer a good system for binary (pre-compiled) wheels to Gensim users. |
Here are the cons I can think of:
|
@piskvorky I'm +1 for dropping too (because we have wheels right now), this has no sense to support 2 versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work 👍
setup.py
Outdated
@@ -256,9 +256,12 @@ def finalize_options(self): | |||
Extension('gensim.models.doc2vec_inner', | |||
sources=['./gensim/models/doc2vec_inner.c'], | |||
include_dirs=[model_dir]), | |||
Extension('gensim.corpora._mmreader', | |||
sources=['./gensim/corpora/_mmreader.c'], | |||
include_dirs=[model_dir]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why model_dir
here? This isn't related to gensim/models
.
gensim/corpora/_mmreader.pyx
Outdated
|
||
def __init__(self, input, transposed=True): | ||
""" | ||
Initialize the matrix reader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use numpy-style docstrings (here and everywhere).
logger = logging.getLogger(__name__) | ||
|
||
|
||
cdef class MmReader(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost same as a pure-python version (sscanf
different), maybe something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, nothing else really. It's a very trivial change. Just a change to use sscanf instead of split() and int() and a few type definitions (eg, docid, termid, and previd in iter).
No doubt converting the actual file access to c would further improve performance, but that will require dealing with all the different file types that smart_open accepts. Also, I don't think this is the main hotspot in iter.
For example, here's a line profiler output of the current (develop branch) code when reading a raw text file:
Line # % Time Line Contents ============================== 907 0.0 with utils.file_or_filename(self.input) as lines: 908 0.0 self.skip_headers(lines) 911 8.6 for line in lines: 912 40.4 docid, termid, val = utils.to_unicode(line).split() # needed for python3 913 7.0 if not self.transposed: 914 termid, docid = docid, termid 915 # -1 because matrix market indexes are 1-based => convert to 0-based 916 22.5 docid, termid, val = int(docid) - 1, int(termid) - 1, float(val) 917 6.7 assert previd <= docid, "matrix columns must come in ascending order" 918 6.4 if docid != previd: 931 932 8.4 document.append((termid, val,)) # add another field to the current document 933
The majority of time is spent splitting the line and converting the resulting strings to ints (lines 912 and 916). sscanf handles both of these lines. Another 20% or so is taken up by the testing self.transposed and comparing docid and previd (lines 913, 917, and 918). Statically typing these variables in cython helps quite a bt too.
For a compressed text file, line 911 (for line in lines) takes up 33% of the time, but most of the time is still taken by line.split() and int().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the line profile output when reading a compressed file
Line # % Time Line Contents ============================== 907 0.0 with utils.file_or_filename(self.input) as lines: 908 0.0 self.skip_headers(lines) 911 31.5 for line in lines: 912 30.6 docid, termid, val = utils.to_unicode(line).split() # needed for python3 913 5.1 if not self.transposed: 914 termid, docid = docid, termid 915 # -1 because matrix market indexes are 1-based => convert to 0-based 916 17.0 docid, termid, val = int(docid) - 1, int(termid) - 1, float(val) 917 4.9 assert previd <= docid, "matrix columns must come in ascending order" 918 4.6 if docid != previd: 932 6.3 document.append((termid, val,)) # add another field to the current document
@arlenk please update |
@menshikh-iv, i updated my develop branch If you want to drop the python only version, let me know, and I can update the PR to only include the cython version (perhaps in a gensim.corpora.mmreader module). |
@arlenk great, yes, please remove pure python version, and this LGTM, wdyt @piskvorky? |
What are the compatibility implications of this? Any changes needed in existing user code? Does it support the same range of inputs (compressed etc)? |
I don't think there are any compatibility implications. It should be a drop in replacement for the current mmreader.
Yes, the actual opening of the file and looping through the rows is still handled by python/smart_open. The only part that is in c is "splitting" each line and converting the values to ints/floats. I added a few more unittests for MmCorpus as well, including one with a compressed file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, problem is
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "gensim/__init__.py", line 6, in <module>
from gensim import parsing, matutils, interfaces, corpora, models, similarities, summarization, utils # noqa:F401
File "gensim/corpora/__init__.py", line 8, in <module>
from .mmcorpus import MmCorpus # noqa:F401
File "gensim/corpora/mmcorpus.py", line 17, in <module>
from gensim.corpora.mmreader import MmReader
ImportError: No module named mmreader
I'm not sure, but maybe "incule_dir" solve this problem.
setup.py
Outdated
@@ -256,9 +256,11 @@ def finalize_options(self): | |||
Extension('gensim.models.doc2vec_inner', | |||
sources=['./gensim/models/doc2vec_inner.c'], | |||
include_dirs=[model_dir]), | |||
Extension('gensim.corpora.mmreader', | |||
sources=['./gensim/corpora/mmreader.c']), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, "include_dirs" should be gensim/corpora
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @menshikh-iv . It looks like it is failing on the bold line below in the tox config:
python -c "from gensim.models.word2vec import FAST_VERSION; print(FAST_VERSION)" python setup.py build_ext --inplace python -c "from gensim.models.word2vec import FAST_VERSION; print(FAST_VERSION)"
This tries to import word2vec before any of the cython modules are compiled. word2vec imports gensim.corpor, which import mmreader. Since I took out the python version of mmreader, this line is failing as the cython version isn't built yet (this happens on the next line of the tox config).
For now, I removed the bold line to confirm the build works. If the plan is to only include the c/cython versions of code, I am not sure if the bold line above is still necessary. If it is still necessary, then I can add a try/except to gensim.corpora.init, though that would only be used for this line in the tox file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arlenk aha, this means that now we can't use FAST_VERSION
freely, this isn't good.
Probably, we need to move this constant (because this needed for checking that all compiled successfully), to a zone where we didn't receive an exception if something goes wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I added back the FAST_VERSION check to the tox config, so the config is the same as before.
To make the test run, I had to add a try/except block to gensim.corpora.init as that file tries to import MmCorpus. Since this check happens before the c-extension is built, MmCorpus fails because there is no pure python MmReader any more. The except block in the init file just creates a dummy MmCorpus class so the code "works" (ie, the imports don't fail, but obviously the dummy classes will not work if used).
This is not a very elegant solution, but I can't think of a better option if that check in the tox file is necessary. Maybe there is a different way to check if the compilation step failed? Also, is the compilation step still necessary with a wheel distribution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is not a good solution.
Compilation step is a mandatory step for wheels, I agree, but we also have repo & developers who work with it.
As alternative, I propose to replace import in tox.ini
to something like this from gensim.models.word2vec_inner import FAST_VERSION
, i.e. direct import, this will be helpful for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As alternative, I propose to replace import in tox.ini to something like this from gensim.models.word2vec_inner import FAST_VERSION, i.e. direct import, this will be helpful for this case?
That's a good idea, but unfortunately that fails too. Importing anything under gensim.* implicitly runs gensim/init.py, which imports gensim.corpora, which imports mmreader (which doesn't exist yet at this point in the tox process).
Do you know what this line in the tox config is meant to test?
python -c "from gensim.models.word2vec import FAST_VERSION; print(FAST_VERSION)" python setup.py build_ext --inplace python -c "from gensim.models.word2vec import FAST_VERSION; print(FAST_VERSION)" pytest {posargs:gensim/test}
It seems like it tests whether all the "pure python" code installed correctly, but I wonder if there is a different way to test that. Or is that test necessary since either the following line (the setup.py line) or the pytest line will fail if the install was not good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what this line in the tox config is meant to test?
I added it when I write the first version of tox.ini
for sanity checking only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to remove this (and also remove try/except block in python file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what this line in the tox config is meant to test?
I added it when I write the first version of tox.ini for sanity checking only
ah, ok, that makes sense. I will remove this line for now and drop the try/except.
thanks
Hello @arlenk, sorry for waiting, can you make last changes
and PR will be ready to merge. |
gensim/corpora/ucicorpus.py
Outdated
@@ -15,7 +15,7 @@ | |||
from gensim import utils | |||
from gensim.corpora import Dictionary | |||
from gensim.corpora import IndexedCorpus | |||
from gensim.matutils import MmReader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move MmReader
to matutils
(yes, this is wired, but needed for backward compatibility). We'll move it to corpora
with major release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok done. I kept the cython version as "gensim.corpora._mmreader". Let me know if you want this renamed to gensim.corpora.mmreader or something else.
@@ -256,9 +256,11 @@ def finalize_options(self): | |||
Extension('gensim.models.doc2vec_inner', | |||
sources=['./gensim/models/doc2vec_inner.c'], | |||
include_dirs=[model_dir]), | |||
Extension('gensim.corpora._mmreader', | |||
sources=['./gensim/corpora/_mmreader.c']), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include_dirs
no needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I don't think it's needed. The other pyx/c files include model_dir because they reference each other, but this module doesn't need to reference any of the other cython files.
gensim/corpora/mmcorpus.py
Outdated
# try to load fast, cythonized code if possible | ||
from gensim.corpora._mmreader import MmReader | ||
FAST_VERSION = 1 | ||
logger.info('Fast version of MmReader is being used') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to log anything here (and after, because this will be shown for each import of gensim at all (because we import all stuff in __init__.py
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will remove
gensim/corpora/mmcorpus.py
Outdated
try: | ||
# try to load fast, cythonized code if possible | ||
from gensim.corpora._mmreader import MmReader | ||
FAST_VERSION = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need FAST_VERSION
here at all (this is connected with BLAS stuff, that isn't used here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good will remove this too
gensim/corpora/mmcorpus.py
Outdated
@@ -19,21 +19,78 @@ | |||
logger = logging.getLogger('gensim.corpora.mmcorpus') | |||
|
|||
|
|||
class MmCorpus(matutils.MmReader, IndexedCorpus): | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this try/except section to matutils.py
and restore all imports (this is simplest and clearest way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried moving the import to matutils, but it causes some issues with circular imports.
For example, if I put the following in matutils.py:
try:
# try loading cython version of mmreader
from gensim.corpora._mmreader import MmReader
except ImportError:
# else fallback to python version
class MmReader(object):
pass
When the try block tries to import gensim.corpora._mmreader, that executes gensim.corpora.__init__, which tries to import gensim.corpora.mmcorpus.MmCorpus, which unfortunately tries to import matutils.MmReader, which isn't defined yet (as we are still in the try block).
You end up with a exception like:
----> 1 import gensim
~/Projects/gensim/gensim-mmreader/gensim/__init__.py in <module>()
4 """
5
----> 6 from gensim import parsing, matutils, interfaces, corpora, models, similarities, summarization, utils # noqa:F401
7 import logging
8
~/Projects/gensim/gensim-mmreader/gensim/matutils.py in <module>()
33 try:
34 # try loading cython version of mmreader
---> 35 from gensim.corpora._mmreader import MmReader
~/Projects/gensim/gensim-mmreader/gensim/corpora/__init__.py in <module>()
6 from .indexedcorpus import IndexedCorpus # noqa:F401 must appear before the other classes
7
----> 8 from .mmcorpus import MmCorpus # noqa:F401
9 from .bleicorpus import BleiCorpus # noqa:F401
~/Projects/gensim/gensim-mmreader/gensim/corpora/mmcorpus.py in <module>()
21
---> 22 class MmCorpus(matutils.MmReader, IndexedCorpus):
23 """
AttributeError: module 'gensim.matutils' has no attribute 'MmReader'
I couldn't think of an easy way around this, short of moving _mmreader out of the corpora package, but I'll keep playing around with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's really sad, I understand what happens :( To avoid this, we need to move _mmreader
from corpora (but I have no idea what's better place for it).
pass | ||
|
||
def test_load(self): | ||
self.assertRaises(ValueError, lambda: [doc for doc in self.corpus]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't catch, what is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you asking specifically about TestMmCorpusCorrupt? That's a test to confirm the code will raise an exception if the input file is malformed. The new code checks that sscanf reads in 3 values, so I just test that and raise an exception if fewer values are read.
As for the other tests, I added tests for:
- the two compressed formats (gzip and bz2)
- to make sure you can access docs by index for a corpus with an index
- to check the values read back match what we expect (mostly, I was worried about floats getting truncated to ints)
@@ -215,10 +215,10 @@ def _get_slice(corpus, slice_): | |||
self.assertRaises(RuntimeError, _get_slice, corpus_, 1.0) | |||
|
|||
|
|||
class TestMmCorpus(CorpusTestCase): | |||
class TestMmCorpusWithIndex(CorpusTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's about backward compatibility tests? Or we have no problems here because we change the only reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed that all the tests pass with the existing version of MmReader as well, so there shouldn't be any backward compatibility issues.
Mmreader
Mmreader
MmReader
@CLearERR one more task for you (documenting), add this to "docsim PR", you need to go through files & cleanup docstrings after I merge this PR:
Be accurate:
|
@arlenk great work! I fixed import-error, will merge this after CI if all fine, let's return to LDA cythonization PR. |
* simple cython version of mmreader * confirm MmCorpus read correct values from corpus file * remove unnecessary include dirs for _mmreader.c * add numpy-style docstrings * update conversion specifier for doubles for sscanf * drop plain python version of mmreader * additional unittests for mmcorpus * add gzipped mmcorpus unittest * add bzip compressed test corpus * remove tox test for non-compiled word2vec * add back tox test for non-compiled word2vec * add dummy classes for tox's non-compiled word2vec check * fix ordering of imports for gensim.corpora * remove tox test for non-compiled code * update MANIFEST to include cython mmreader * add back python only version of MmReader * mend * mend * move python version of MmReader back to matutils * fix import problems * fix import order
@arlenk great work, thank you very much 🔥! |
@arlenk how did you create the text8 |
Sure, here are the files (I had to gzip the index file as well so that github would allow me to attach it). Here's the code I used for creating the files from the original text8 corpus import gensim.downloader as api
import gensim.corpora as gc
import numpy as np
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
np.seterr(all='raise')
# load data
_corpus_path = '~/corpora/'
def save_corpus(name='text8'):
texts = api.load(name)
print("creating dictionary")
dictionary = gc.Dictionary([[ word.lower() for word in text ] for text in texts ])
outfile = os.path.join(_corpus_path, "{}.dict".format(name))
print("saving dictionary to: {}".format(outfile))
dictionary.save(outfile)
print("converting to corpus")
corpus = [ dictionary.doc2bow([ word.lower() for word in text]) for text in texts ]
outfile = os.path.join(_corpus_path, "{}.mm".format(name))
print("saving corpus to {}".format(outfile))
gc.MmCorpus.serialize(outfile, corpus) |
and here's the code for the tfidf version. I created this version just to make sure there were no issues with floating point numbers in the input file. import gensim.models as gm
import gensim.corpora as gc
corpus = gc.MmCorpus('~/corpora/text8.mm',)
tfidf = gm.TfidfModel(corpus)
corpus_tfidf = [ tfidf[doc] for doc in corpus ]
gc.MmCorpus.serialize('~/corpora/text8.tfidf.mm', corpus_tfidf) |
Thanks a lot @arlenk ! |
Very naive version of mmreader in cython.
From some profiling, when simply looping through an uncompressed mm corpus, 60% of the time is spent splitting each line and converting the strings to ints (or floats). For a compressed corpus, about 45% of the time is spent in splitting and converting to ints/floats.
This version simply performs those two steps in cython using sscanf. The rest of the code is pretty much unchanged -- though still compiled in cython for simplicity. This change has a decent impact on performance, especially for uncompressed files. The table below shows some profiling results for looping through a few different corpuses in mm format (the tfidf corpus is used to make sure the code works with an mm file that has floats as well).
In theory, the rest of the code could be converted into c code as well, though that would require writing c code to handle all the different file types that the mmreader accepts (eg, compressed files) and from profiling results, that portion of the code only accounts for 10% (uncompressed files) to 30% (compressed files) of the time.
The real world impact of this change is much more modest than the results above (unless for some reason you are just looping through a corpus...), but I have seen time savings of 15% in some real world uses of LDA with this change. Obviously the impact any user sees will depend on the size of their corpus and the type of model they are using.