Optimize word mover distance (WMD) computation #3163

flowlight0 · 2021-06-05T20:21:06Z

This change makes WMD computation faster by replacing a heavy nested loop for distance matrix construction with scipy's faster implementation.

I verified its performance improvement with the following micro-benchmark. It generated the next two different set of text pairs by sampling texts from 20newsgroups and measured WMD computation speed of both versions.

Long: 500 pairs of full texts from 20newsgroups. The average number of tokens in this dataset is 256.69.
Short: 50,000 pairs of truncated texts. I truncated texts from 20newsgroups such that whose maximum number of tokens become less than or equal to 30.

Benchmark

import random
import time
from pathlib import Path
from typing import List, Tuple

import joblib
import numpy as np
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.datasets import fetch_20newsgroups
from tqdm import tqdm

from gensim.models import KeyedVectors

stopwords = stopwords.words("english")


def tokenize(model: KeyedVectors, text: str):
    return [token for token in word_tokenize(text.strip()) if token.lower() not in stopwords and token in model]


def average_text_length(text_pairs):
    lengths = []
    for (a, b) in text_pairs:
        lengths.append((len(a)))
        lengths.append((len(b)))
    return np.mean(lengths)


def load_text_pairs(model: KeyedVectors, num_pairs=500, num_texts=1000, max_length=None, seed=0):
    newsgroups_train = fetch_20newsgroups(subset='train')
    tokenized_texts = [tokenize(model=model, text=text) for text in newsgroups_train["data"][:num_texts]]
    if max_length is not None:
        tokenized_texts = [tokenized_text[:max_length] for tokenized_text in tokenized_texts]
    random.seed(seed)
    return [(random.choice(tokenized_texts), random.choice(tokenized_texts)) for _ in range(num_pairs)]


def run(model: KeyedVectors, text_pairs: List[Tuple[str, str]]):
    start_time = time.time()
    values = []
    for (a, b) in tqdm(text_pairs, "Computing WMD", total=len(text_pairs)):
        values.append(model.wmdistance(a, b))
    print(f"Elapsed time: {time.time() - start_time:.4f} [s]")
    return values


def main():
    out_dir = Path("./output")
    out_dir.mkdir(exist_ok=True, parents=True)
    for d in [50, 100, 200]:
        print(f"The dimensionality of word vectors: {d}")
        model = KeyedVectors.load_word2vec_format(f"./glove.6B.{d}d.txt", no_header=True)
        long_pairs = load_text_pairs(model=model, num_pairs=500)
        print(f"The average number of tokens in long texts: {average_text_length(long_pairs):.4f}")
        values = run(model, long_pairs)
        joblib.dump(values, out_dir / f"long.{d}.bin")

        short_pairs = load_text_pairs(model=model, num_pairs=50000, max_length=25)
        print(f"The average number of tokens in short texts: {average_text_length(short_pairs):.4f}")
        values = run(model, short_pairs)
        joblib.dump(values, out_dir / f"short.{d}.bin")
        print("")


if __name__ == '__main__':
    main()

Result
The next two tables show how WMD computation performance changed. The computation speed becomes consistently faster (7x speedup for short text paris and 2x speedup for long text pairs).

Before

vector size	time (short text) [s]	time (long text) [s]
50	287.76	179.40
100	290.24	180.13
200	321.66	187.87

After

vector size	time (short text) [s]	time (long text) [s]
50	41.89	82.12
100	43.29	87.39
200	46.95	86.37

Of course, I checked this change doesn't break the current behavior by checking outputs from two versions as follows:

In [10]: import joblib
In [11]: import numpy
In [12]: a = joblib.load("./new_output/long.50.bin")
In [13]: b = joblib.load("./old_output/long.50.bin")
In [14]: numpy.allclose(a, b)
Out[14]: True

tox -e flake8 succeeded
tox -e py36-linux succeeded

(Update 2021-06-07: fixed explanation about the experimental results based on @gojomo 's comment)

gojomo · 2021-06-06T17:57:49Z

Looks like a straightforward & valuable optimization to me! Thanks for the contribution!

(I think you meant to write, "7x speedup for short text paris and 2x speedup for long text pairs".)

As an aside (not at all a blocker for this fix, just dumping some thoughts where other WMD users may find them): I suspect a bunch of other optimization for common WMD usage patterns are possible - in particular, if doing pairwise WMDs for a batch of documents, preparing the word-to-word distnace-matrix once for the superset of all their words-used, provided that doesn't grow too large, might provide a noticeable speedup. (If so, an API for this might accept a sequence of all desired comparisons, then batch as many as fit within some manageable amount of memory automatically.)

I also suspect there are a bunch of small-deviations from the exact WMD algorithm that speed it up a lot, only sacrificing a little precision, perhaps making it practical on much larger docs/sets-of-docs. For example, discarding large ranges of high-frequency or low-frequency words, to work with more manageable vocabularies, or coalescing many of the words in large documents before comparisons into a smaller number of pseudoword 'grains'. As an example of the pre-comparison compression that might be tested: "while document is larger than 10 words, merge its two closest words into a single average word." Compare also: SpaCy's trick of aliasing a larger number of word-keys to a smaller number of vectors. (IIUC, when a rarer word's vector is "very close" to a more-frequent word's vector, they may discard the rarer vector, and just have that word-key point to the same surviving vector.)

flowlight0 · 2021-06-07T10:50:27Z

Thanks @gojomo for pointing out my mistake in the PR description and suggesting a couple of strategies for further optimization :)

flowlight0 · 2021-06-18T04:37:38Z

Is there any additional action required for this change to be merged in the future?

piskvorky · 2021-06-18T05:29:47Z

Thanks, I think we're good. Let's wait for @mpenkov 's review and merge.

mpenkov

Looks good to me. Thank you for your effort @flowlight0 !

mpenkov · 2021-06-29T01:48:03Z

Merged. Congrats on your first PR to gensim @flowlight0 ! 🥇

flowlight0 · 2021-06-29T02:42:24Z

Thanks for the gensim team's effort on reviewing and merging this PR!

Faster WMD computation by removing a nested loop

1ff7416

mpenkov changed the title ~~Faster word mover distance (WMD) computation by removing a nested loop~~ Optimize word mover distance (WMD) computation Jun 7, 2021

mpenkov approved these changes Jun 22, 2021

View reviewed changes

mpenkov added 2 commits June 29, 2021 10:46

Update CHANGELOG.md

6ffedce

Merge branch 'develop' into faster-wmdistance

fc19eb9

mpenkov merged commit b378b1b into piskvorky:develop Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize word mover distance (WMD) computation #3163

Optimize word mover distance (WMD) computation #3163

flowlight0 commented Jun 5, 2021 •

edited

Loading

gojomo commented Jun 6, 2021

flowlight0 commented Jun 7, 2021

flowlight0 commented Jun 18, 2021

piskvorky commented Jun 18, 2021

mpenkov left a comment

mpenkov commented Jun 29, 2021

flowlight0 commented Jun 29, 2021

Optimize word mover distance (WMD) computation #3163

Optimize word mover distance (WMD) computation #3163

Conversation

flowlight0 commented Jun 5, 2021 • edited Loading

gojomo commented Jun 6, 2021

flowlight0 commented Jun 7, 2021

flowlight0 commented Jun 18, 2021

piskvorky commented Jun 18, 2021

mpenkov left a comment

Choose a reason for hiding this comment

mpenkov commented Jun 29, 2021

flowlight0 commented Jun 29, 2021

flowlight0 commented Jun 5, 2021 •

edited

Loading