-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance question on sorting on Dictionary.doc2bow and MmWriter.write_vector #2043
Comments
The fact that sparse vectors are sorted by id is an API requirement. It cannot be relegated to user-land. Both list sorting and converting text to utf8 are extremely fast, are these really the bottleneck? Can you share your profiling numbers? |
It seems that sorted ordering is not well documented. Which API requires ordering perhaps I'm not using an API that has this requirement. I'm looking at LdaModel inference function and the token ids are used for indexing into a numpy array, thus order is insignificant.
|
Here's the timing for write_vector, slightly refactored to evaluate different expression times. Significant amount time is spent in sorting the list, note the non zero filter can be migrated to within the loop. The vector was generated by doc2bow, thus it is already sorted, although not filtered, but then the weights are all either 1 or higher, so nothing will be filtered. The other costly operations are the string operations, formatting a string and then it's conversion to binary. These 2 operations take 50% of the time. IO which is usually the slowest operation is only 18%.
|
Here's the results of profiling doc2bow. The 2 sorts account for 14% of the time. Notes: allow_update is true, return_missing is False
|
Thanks for investigating @darindf . What version of Gensim is this? There were some I/O optimizations recently in #1825. I'm afraid not much can be done about the sorting. Bypassing such basic data sanity normalization, in favour of a corner case optimization, would lead to trouble elsewhere. Perhaps you can remove the sorting locally? |
@piskvorky optimization from #1825 only for the reader (not for the writer), i.e. #1825 doesn't affect the current case. @darindf I'm +1 for #2043 (comment), this is important "sanity normalization", if we drop it, I'm 100% sure that some cases (not trivial) can be broken after this change. |
Thanks for the feed back. This version of gensim has #1825 and is for the mmreader. Performing benchmarks with the before/after case, I get approximately 20% time reduction processing 200k documents as this part of the program reads documents from the database, creates the bow (and updates the dictionary) and writes the bow vector to the marketmatrix file. |
I have a question on performance. I have several large documents, on the order of several megabytes, and was performing some profiling and noticed some hot spots.
I see that the doc2bow on Dictionary on returning the result, i.e. list of tuples of tokenid's, and their frequency, sorts this list
What is more interesting is that this behavior is not defined in the documentation
Then when saving the document into market matrix file, using MmWriter.write_vector, the tokenids are again sorted.
Is there a specific reason why the tokenids are sorted? I can see from a debugging perspective, but not from a performance aspect, and the sorting can always be done by the user after calling doc2bow, or prior to calling write_vector.
I have another side question on MmWriter, why is it writing to files in binary mode, when either it's text, i.e. the header line, or values of integer and/or float, as I'm concerned with the overhead that
adds, in performing the conversion to binary and the string formating, when these values are integers for docno and termid, and weight can be either integer or float.
The text was updated successfully, but these errors were encountered: