Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance question on sorting on Dictionary.doc2bow and MmWriter.write_vector #2043

Closed
darindf opened this issue May 9, 2018 · 7 comments

Comments

@darindf
Copy link
Contributor

darindf commented May 9, 2018

I have a question on performance. I have several large documents, on the order of several megabytes, and was performing some profiling and noticed some hot spots.

I see that the doc2bow on Dictionary on returning the result, i.e. list of tuples of tokenid's, and their frequency, sorts this list

# return tokenids, in ascending id order
result = sorted(iteritems(result))

What is more interesting is that this behavior is not defined in the documentation

Convert `document` into the bag-of-words (BoW) format = list of (token_id, token_count)

Then when saving the document into market matrix file, using MmWriter.write_vector, the tokenids are again sorted.

vector = sorted((i, w) for i, w in vector if abs(w) > 1e-12)  # ignore near-zero entries

Is there a specific reason why the tokenids are sorted? I can see from a debugging perspective, but not from a performance aspect, and the sorting can always be done by the user after calling doc2bow, or prior to calling write_vector.

I have another side question on MmWriter, why is it writing to files in binary mode, when either it's text, i.e. the header line, or values of integer and/or float, as I'm concerned with the overhead that

self.fout.write(utils.to_utf8("%i %i %s\n" % (docno + 1, termid + 1, weight)))

adds, in performing the conversion to binary and the string formating, when these values are integers for docno and termid, and weight can be either integer or float.

@piskvorky
Copy link
Owner

piskvorky commented May 10, 2018

The fact that sparse vectors are sorted by id is an API requirement. It cannot be relegated to user-land.

Both list sorting and converting text to utf8 are extremely fast, are these really the bottleneck? Can you share your profiling numbers?

@darindf
Copy link
Contributor Author

darindf commented May 10, 2018

It seems that sorted ordering is not well documented. Which API requires ordering perhaps I'm not using an API that has this requirement.

I'm looking at LdaModel inference function and the token ids are used for indexing into a numpy array, thus order is insignificant.

        # Now, for each document d update that document's gamma and phi
        # Inference code copied from Hoffman's `onlineldavb.py` (esp. the
        # Lee&Seung trick which speeds things up by an order of magnitude, compared
        # to Blei's original LDA-C code, cool!).
        for d, doc in enumerate(chunk):
            if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
                # make sure the term IDs are ints, otherwise np will get upset
                ids = [int(idx) for idx, _ in doc]
            else:
                ids = [idx for idx, _ in doc]
            cts = np.array([cnt for _, cnt in doc], dtype=self.dtype)
            gammad = gamma[d, :]
            Elogthetad = Elogtheta[d, :]
            expElogthetad = expElogtheta[d, :]
            expElogbetad = self.expElogbeta[:, ids]

@darindf
Copy link
Contributor Author

darindf commented May 10, 2018

Here's the timing for write_vector, slightly refactored to evaluate different expression times. Significant amount time is spent in sorting the list, note the non zero filter can be migrated to within the loop. The vector was generated by doc2bow, thus it is already sorted, although not filtered, but then the weights are all either 1 or higher, so nothing will be filtered.

The other costly operations are the string operations, formatting a string and then it's conversion to binary. These 2 operations take 50% of the time. IO which is usually the slowest operation is only 18%.

Total time: 214.43 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    29                                               @profile
    30                                               def write_vector(self, docno, vector):
    31                                                   """Write a single sparse vector to the file.
    32
    33                                                   Parameters
    34                                                   ----------
    35                                                   docno : int
    36                                                       Number of document.
    37                                                   vector : list of (int, number)
    38                                                       Document in BoW format.
    39
    40                                                   Returns
    41                                                   -------
    42                                                   (int, int)
    43                                                       Max word index in vector and len of vector. If vector is empty, return (-1, 0).
    44
    45                                                   """
    46    200000     705635.0      3.5      0.1          assert self.headers_written, "must write Matrix Market file headers before writing data!"
    47    200000     461901.0      2.3      0.1          assert self.last_docno < docno, "documents %i and %i not in sequential order!" % (
    48                                                       self.last_docno, docno)
    49    200000  106853804.0    534.3     16.0          vector = sorted((i, w) for i, w in vector if abs(w)
    50                                                                   > 1e-12)  # ignore near-zero entries
    51  55735579   87383778.0      1.6     13.1          for termid, weight in vector:  # write term ids in sorted order
    52                                                       # +1 because MM format starts counting from 1
    53  55535579  147871611.0      2.7     22.1              string = "%i %i %s\n" % (docno + 1, termid + 1, weight)
    54  55535579  201079838.0      3.6     30.1              utf8 = utils.to_utf8(string)
    55  55535579  122919925.0      2.2     18.4              self.fout.write(utf8)
    56    200000     443498.0      2.2      0.1          self.last_docno = docno
    57    200000     699743.0      3.5      0.1          return (vector[-1][0], len(vector)) if vector else (-1, 0)

@darindf
Copy link
Contributor Author

darindf commented May 10, 2018

Here's the results of profiling doc2bow. The 2 sorts account for 14% of the time.

Notes: allow_update is true, return_missing is False

Total time: 326.349 s
Function: doc2bow at line 390

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   390                                               @profile
   391                                               def doc2bow(self, document, allow_update=False, return_missing=False):
   392                                                   """Convert `document` into the bag-of-words (BoW) format = list of (token_id, token_count).
   393
   394                                                   Parameters
   395                                                   ----------
   396                                                   document :  list of str
   397                                                       Input document.
   398                                                   allow_update : bool, optional
   399                                                       If True - update dictionary in the process (i.e. add new tokens and update frequencies).
   400                                                   return_missing : bool, optional
   401                                                       Also return missing tokens (that doesn't contains in current dictionary).
   402
   403                                                   Return
   404                                                   ------
   405                                                   list of (int, int)
   406                                                       BoW representation of `document`
   407                                                   list of (int, int), dict of (str, int)
   408                                                       If `return_missing` is True, return BoW representation of `document` + dictionary with missing
   409                                                       tokens and their frequencies.
   410
   411                                                   Examples
   412                                                   --------
   413                                                   >>> from gensim.corpora import Dictionary
   414                                                   >>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
   415                                                   >>> dct.doc2bow(["this","is","máma"])
   416                                                   [(2, 1)]
   417                                                   >>> dct.doc2bow(["this","is","máma"], return_missing=True)
   418                                                   ([(2, 1)], {u'this': 1, u'is': 1})
   419
   420                                                   """
   421    200000    1155314.0      5.8      0.1          if isinstance(document, string_types):
   422                                                       raise TypeError(
   423                                                           "doc2bow expects an array of unicode tokens on input, not a single string")
   424
   425                                                   # Construct (word, frequency) mapping.
   426    200000    1193112.0      6.0      0.1          counter = defaultdict(int)
   427 130932509  211663990.0      1.6     20.8          for w in document:
   428 130732509  324987575.0      2.5     31.9              counter[w if isinstance(w, unicode) else unicode(w, 'utf-8')] += 1
   429
   430    200000     684697.0      3.4      0.1          token2id = self.dictionary.token2id
   431    200000     358058.0      1.8      0.0          if allow_update or return_missing:
   432    200000     650063.0      3.3      0.1              missing = sorted(x for x in iteritems(
   433    200000   84813070.0    424.1      8.3                  counter) if x[0] not in token2id)
   434    200000     454015.0      2.3      0.0              if allow_update:
   435    495888    1051980.0      2.1      0.1                  for w, _ in missing:
   436                                                               # new id = number of ids made so far;
   437                                                               # NOTE this assumes there are no gaps in the id sequence!
   438    295888     806651.0      2.7      0.1                      token2id[w] = len(token2id)
   439
   440    200000     460326.0      2.3      0.0          result = {token2id[w]: freq for w,
   441    200000   77163622.0    385.8      7.6                    freq in iteritems(counter) if w in token2id}
   442
   443    200000     442439.0      2.2      0.0          if allow_update:
   444    200000     726779.0      3.6      0.1              self.dictionary.num_docs += 1
   445    200000    2942745.0     14.7      0.3              self.dictionary.num_pos += sum(itervalues(counter))
   446    200000     673287.0      3.4      0.1              self.dictionary.num_nnz += len(result)
   447                                                       # increase document count for each unique token that appeared in the document
   448    200000     414727.0      2.1      0.0              dfs = self.dictionary.dfs
   449  55735579   95271296.0      1.7      9.4              for tokenid in iterkeys(result):
   450  55535579  153648424.0      2.8     15.1                  dfs[tokenid] = dfs.get(tokenid, 0) + 1
   451
   452                                                   # return tokenids, in ascending id order
   453    200000   56915359.0    284.6      5.6          result = sorted(iteritems(result))
   454    200000     481080.0      2.4      0.0          if return_missing:
   455                                                       return result, dict(missing)
   456                                                   else:
   457    200000     333753.0      1.7      0.0              return result

@piskvorky
Copy link
Owner

Thanks for investigating @darindf . What version of Gensim is this? There were some I/O optimizations recently in #1825.

I'm afraid not much can be done about the sorting. Bypassing such basic data sanity normalization, in favour of a corner case optimization, would lead to trouble elsewhere. Perhaps you can remove the sorting locally?

@menshikh-iv
Copy link
Contributor

@piskvorky optimization from #1825 only for the reader (not for the writer), i.e. #1825 doesn't affect the current case.

@darindf I'm +1 for #2043 (comment), this is important "sanity normalization", if we drop it, I'm 100% sure that some cases (not trivial) can be broken after this change.

@darindf
Copy link
Contributor Author

darindf commented May 11, 2018

Thanks for the feed back. This version of gensim has #1825 and is for the mmreader.

Performing benchmarks with the before/after case, I get approximately 20% time reduction processing 200k documents as this part of the program reads documents from the database, creates the bow (and updates the dictionary) and writes the bow vector to the marketmatrix file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants