Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize word mover distance (WMD) computation #3163

Merged
merged 3 commits into from
Jun 29, 2021

Conversation

flowlight0
Copy link
Contributor

@flowlight0 flowlight0 commented Jun 5, 2021

This change makes WMD computation faster by replacing a heavy nested loop for distance matrix construction with scipy's faster implementation.

I verified its performance improvement with the following micro-benchmark. It generated the next two different set of text pairs by sampling texts from 20newsgroups and measured WMD computation speed of both versions.

  • Long: 500 pairs of full texts from 20newsgroups. The average number of tokens in this dataset is 256.69.
  • Short: 50,000 pairs of truncated texts. I truncated texts from 20newsgroups such that whose maximum number of tokens become less than or equal to 30.

Benchmark

import random
import time
from pathlib import Path
from typing import List, Tuple

import joblib
import numpy as np
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.datasets import fetch_20newsgroups
from tqdm import tqdm

from gensim.models import KeyedVectors

stopwords = stopwords.words("english")


def tokenize(model: KeyedVectors, text: str):
    return [token for token in word_tokenize(text.strip()) if token.lower() not in stopwords and token in model]


def average_text_length(text_pairs):
    lengths = []
    for (a, b) in text_pairs:
        lengths.append((len(a)))
        lengths.append((len(b)))
    return np.mean(lengths)


def load_text_pairs(model: KeyedVectors, num_pairs=500, num_texts=1000, max_length=None, seed=0):
    newsgroups_train = fetch_20newsgroups(subset='train')
    tokenized_texts = [tokenize(model=model, text=text) for text in newsgroups_train["data"][:num_texts]]
    if max_length is not None:
        tokenized_texts = [tokenized_text[:max_length] for tokenized_text in tokenized_texts]
    random.seed(seed)
    return [(random.choice(tokenized_texts), random.choice(tokenized_texts)) for _ in range(num_pairs)]


def run(model: KeyedVectors, text_pairs: List[Tuple[str, str]]):
    start_time = time.time()
    values = []
    for (a, b) in tqdm(text_pairs, "Computing WMD", total=len(text_pairs)):
        values.append(model.wmdistance(a, b))
    print(f"Elapsed time: {time.time() - start_time:.4f} [s]")
    return values


def main():
    out_dir = Path("./output")
    out_dir.mkdir(exist_ok=True, parents=True)
    for d in [50, 100, 200]:
        print(f"The dimensionality of word vectors: {d}")
        model = KeyedVectors.load_word2vec_format(f"./glove.6B.{d}d.txt", no_header=True)
        long_pairs = load_text_pairs(model=model, num_pairs=500)
        print(f"The average number of tokens in long texts: {average_text_length(long_pairs):.4f}")
        values = run(model, long_pairs)
        joblib.dump(values, out_dir / f"long.{d}.bin")

        short_pairs = load_text_pairs(model=model, num_pairs=50000, max_length=25)
        print(f"The average number of tokens in short texts: {average_text_length(short_pairs):.4f}")
        values = run(model, short_pairs)
        joblib.dump(values, out_dir / f"short.{d}.bin")
        print("")


if __name__ == '__main__':
    main()

Result
The next two tables show how WMD computation performance changed. The computation speed becomes consistently faster (7x speedup for short text paris and 2x speedup for long text pairs).

Before

vector size time (short text) [s] time (long text) [s]
50 287.76 179.40
100 290.24 180.13
200 321.66 187.87

After

vector size time (short text) [s] time (long text) [s]
50 41.89 82.12
100 43.29 87.39
200 46.95 86.37

Of course, I checked this change doesn't break the current behavior by checking outputs from two versions as follows:

In [10]: import joblib
In [11]: import numpy
In [12]: a = joblib.load("./new_output/long.50.bin")
In [13]: b = joblib.load("./old_output/long.50.bin")
In [14]: numpy.allclose(a, b)
Out[14]: True
  • tox -e flake8 succeeded
  • tox -e py36-linux succeeded

(Update 2021-06-07: fixed explanation about the experimental results based on @gojomo 's comment)

@gojomo
Copy link
Collaborator

gojomo commented Jun 6, 2021

Looks like a straightforward & valuable optimization to me! Thanks for the contribution!

(I think you meant to write, "7x speedup for short text paris and 2x speedup for long text pairs".)

As an aside (not at all a blocker for this fix, just dumping some thoughts where other WMD users may find them): I suspect a bunch of other optimization for common WMD usage patterns are possible - in particular, if doing pairwise WMDs for a batch of documents, preparing the word-to-word distnace-matrix once for the superset of all their words-used, provided that doesn't grow too large, might provide a noticeable speedup. (If so, an API for this might accept a sequence of all desired comparisons, then batch as many as fit within some manageable amount of memory automatically.)

I also suspect there are a bunch of small-deviations from the exact WMD algorithm that speed it up a lot, only sacrificing a little precision, perhaps making it practical on much larger docs/sets-of-docs. For example, discarding large ranges of high-frequency or low-frequency words, to work with more manageable vocabularies, or coalescing many of the words in large documents before comparisons into a smaller number of pseudoword 'grains'. As an example of the pre-comparison compression that might be tested: "while document is larger than 10 words, merge its two closest words into a single average word." Compare also: SpaCy's trick of aliasing a larger number of word-keys to a smaller number of vectors. (IIUC, when a rarer word's vector is "very close" to a more-frequent word's vector, they may discard the rarer vector, and just have that word-key point to the same surviving vector.)

@mpenkov mpenkov changed the title Faster word mover distance (WMD) computation by removing a nested loop Optimize word mover distance (WMD) computation Jun 7, 2021
@flowlight0
Copy link
Contributor Author

Thanks @gojomo for pointing out my mistake in the PR description and suggesting a couple of strategies for further optimization :)

@flowlight0
Copy link
Contributor Author

Is there any additional action required for this change to be merged in the future?

@piskvorky
Copy link
Owner

Thanks, I think we're good. Let's wait for @mpenkov 's review and merge.

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you for your effort @flowlight0 !

@mpenkov mpenkov merged commit b378b1b into piskvorky:develop Jun 29, 2021
@mpenkov
Copy link
Collaborator

mpenkov commented Jun 29, 2021

Merged. Congrats on your first PR to gensim @flowlight0 ! 🥇

@flowlight0
Copy link
Contributor Author

Thanks for the gensim team's effort on reviewing and merging this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants