Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary #3157

Merged
merged 39 commits into from
Jun 29, 2021

Conversation

Witiko
Copy link
Contributor

@Witiko Witiko commented May 25, 2021

When searching for similar word embeddings using the KeyedVectors.most_similar() method, we often have a dictionary that limits the number of words that we would like to consider and, for subword models such as FastText that enable word vector inference, also expands the number of words that we would like to consider:

119530938-c6497380-bd83-11eb-8129-9d56a4270dad

This is also true in document similarity measures that use word embeddings as a source of word similarity, such as the Soft Cosine Measure. In the Soft Cosine Measure, the first step is the construction of a word similarity matrix. The word similarity matrix models a dictionary that will often be different from the vocabulary of the word embeddings. The word similarity matrix is sparse and uses the topn parameter of the KeyedVectors.most_similar() method to control how many closest words for each word will be considered. However, if the overlap between our dictionary and the vocabulary of the word embeddings is small, the KeyedVectors.most_similar() method will consistently return fewer than topn closest words from the dictionary and the matrix will be much more sparse than it would have otherwise been. This leads to a lack of control and possibly weaker models.

Proposed solution

The solution @gojomo and I discussed in #3146 (comment) is to have a KeyedVectors.vectors_for_all(words) method that would take an iterable of words (a dictionary) and produce a new KeyedVectors object that would contain vectors only for the requested words. In subword models such as FastText, vectors for words outside the vocabulary would be inferred. This would guarantee that all topn words retreived by the KeyedVectors.most_similar() method originated from our dictionary.

Here is an example usage, which this PR also adds to the documentation:

from gensim.test.utils import common_texts, datapath
from gensim.corpora import Dictionary
from gensim.models import FastText
from gensim.models.word2vec import LineSentence

corpus = common_texts
model = FastText(corpus, vector_size=20, min_count=1)  # train word-vectors on a corpus
different_corpus = LineSentence(datapath('lee_background.cor'))
dictionary = Dictionary(different_corpus)
words = [word for word, count in dictionary.most_common()]  # construct vocabulary on a different corpus
word_vectors = model.wv.vectors_for_all(words)  # remove OOV word-vectors and infer vectors for new words
assert len(dictionary) == len(word_vectors)  # all words from our vocabulary received their word-vectors

@Witiko Witiko changed the title Add a KeyedVectors method that vectorizes all words in a vocabulary Add a KeyedVectors method that vectorizes all words in a dictionary May 25, 2021
@gojomo
Copy link
Collaborator

gojomo commented May 25, 2021

As before, appreciate general utility of this method.

Regarding name, my 1st impression is that fit() is too generic, but maybe could be convinced if there's little else that could be confused with it. (But: mightn't someone get the impression that fit() might do some fine-tuning?)

Perhaps vector_subset()?

I don't think always sorting is a good default, given that sets of word-vectors are most-often pre-sorted in decreasing frequency, which proves useful when users want just a head-subset of results, and often gives a noticeable cache-warmness benefit in typical lookup patterns (with all the frequent words clustered near the front). The default could instead be never-sort (let the caller's chosen ordering specify exactly the order they want) or a non-default option-to-lexical-sort. (Similarly, I suspect a uniqueness-check could be left to the caller, with whatever oddness being created by a nonunique list-of-keys being survivable.)

Also: given the vecattr mechanism for adding typed extra info per key into a KeyedVectors instance - most often, word- frequencies – it may be interesting for such info to be preserved in this subset operation, either by default or (if the space/copy-over-time is too much to bear) only on-request.

@Witiko
Copy link
Contributor Author

Witiko commented May 26, 2021

@gojomo

As before, appreciate general utility of this method.

Thank you. In contrast to the original proposal, I moved the method from FastTextKeyedVectors to KeyedVectors, because I realized that removing unwanted words is just as benefitial and it makes sense to fit word2vec word vectors to a dictionary even though some words from the dictionary may not receive a word vector.

Regarding name, my 1st impression is that fit() is too generic, but maybe could be convinced if there's little else that could be confused with it. (But: mightn't someone get the impression that might do some fine-tuning?)

Perhaps vector_subset()?

If we only took a subset of existing word vectors, then vector_subset() would be a perfect match. However, I think that fit() is better at communicating that we may also infer new word vectors. This inference (almost) requires a single forward pass of the fastText model, so I don't think the name promises more than it delivers from the sklearn perspective. From the language perspective, the method shrinks-and-expands the KeyedVectors object so that it fits the dictionary.

I don't think always sorting is a good default, given that sets of word-vectors are most-often sorted in decreasing frequency, which proves useful when users want just a head-subset of results, and often gives a noticeable cache-warmness benefit in typical lookup patterns

That is a good point. I was aiming at reproducibility, since we are enforcing uniqueness using set(), which randomizes the order. However, we can easily just filter out duplicates without reordering:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys([3, 2, 2, 3, 2, 1]))
[3, 2, 1]

Similarly, I suspect a uniqueness-check could be left to the caller, with whatever oddness being created by a nonunique list-of-keys being survivable.

Since we are already doing a single pass over the entire iterable and we want to know the size of the vocabulary before we start building the KeyedVectors object, it seems reasonable to remove the duplicates and check inferrability, so that we don't overallocate memory. Do you see any utility in not removing the duplicates?

Also: given the vecattr mechanism for adding typed extra info per key into a KeyedVectors instance - most often, word- frequencies – it may be interesting for such info to be preserved in this subset operation, either by default or (if the space/copy-over-time is too much to bear) only on-request.

At the moment, the lack of this functionality is explicitly noted in the docstring. Since we are not only subsetting word vectors, but also possibly inferring new ones, there is no guarantee that all word vectors in the resulting KeyedVectors object would have vector attributes even if all word vectors in the original KeyedVectors object had them, which seems confusing. Perhaps we should also add a parameter that would control whether new word vectors may be inferred. If new word vectors may not be inferred, then this gives us a stronger guarantee about the presence of vector attributes.

def fit(self, dictionary: Union[Iterable, Dictionary], allow_inference : bool = True, copy_vecattrs : bool = False):
    # ...
    if copy_vecattrs:
        for attr in self.expandos:
            try:
                val = self.get_vecattr(key, attr)
                kv.set_vecattr(key, attr, val)
            except KeyError:
                continue
    return kv

@Witiko
Copy link
Contributor Author

Witiko commented May 26, 2021

If we only took a subset of existing word vectors, then vector_subset() would be a perfect match. However, I think that fit() is better at communicating that we may also infer new word vectors. This inference (almost) requires a single forward pass of the fastText model, so I don't think the name promises more than it delivers from the sklearn perspective. From the language perspective, the method shrinks-and-expands the KeyedVectors object so that it fits the dictionary.

It is true that fit() strongly implies in-place modification of a mutable object, which is something our method does not satisfy.

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
@Witiko Witiko marked this pull request as draft May 28, 2021 18:58
@Witiko Witiko force-pushed the feature/vectors-for-all branch from 686945a to 9ebe808 Compare May 28, 2021 20:21
@Witiko Witiko marked this pull request as ready for review May 28, 2021 21:10
@Witiko
Copy link
Contributor Author

Witiko commented May 28, 2021

@piskvorky, @gojomo Following the discussion, the vectors_for_all() method now:

  • Supports Dictionary objects, which it sorts by decreasing collection frequency for cache-warmness.
  • Supports copying vector attributes with copy_vecattrs=True.
  • Supports disabling the inference of new vectors allow_inference=False, so that we can have a guarantee that
    1. we will always receive a subset of the current KeyedVectors object and
    2. all vectors will have vector attributes defined when copy_vecattrs=True.
  • Adds five new unit tests.

@Witiko
Copy link
Contributor Author

Witiko commented May 28, 2021

Unless we want to rename the method¹, or remove the support for Dictionary objects², this PR should be ready for review.


¹ @gojomo suggested vector_subset() and I suggested fit(), materialize(), or vectorize(). However, vector_subset() and fit() are both misleading as discussed in #3157 (comment) and #3157 (comment). I don't prefer materialize() or vectorize() over vectors_for_all().

² @piskvorky suggested the removal in #3157 (comment). I argue that the use of Dictionary will be common and getting a properly sorted iterable out of Dictionary is tricky (many thanks to @gojomo for suggesting ordering by frequency).

@Witiko Witiko force-pushed the feature/vectors-for-all branch from a8f9550 to e5a9a31 Compare May 28, 2021 22:07
@Witiko Witiko changed the title Add a KeyedVectors method that vectorizes all words in a dictionary Add a KeyedVectors method that vectorizes all words in a vocabulary May 28, 2021
@Witiko
Copy link
Contributor Author

Witiko commented Jun 2, 2021

@gojomo @mpenkov Thank you again for the many thoughtful suggestions raised in the reviews. I will implement them in bulk once I have drafted the demo that I promised to @piskvorky in #3146 (comment) and which I am behind schedule on.

gensim/similarities/termsim.py Outdated Show resolved Hide resolved
gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
gensim/corpora/dictionary.py Outdated Show resolved Hide resolved
gensim/test/test_corpora_dictionary.py Outdated Show resolved Hide resolved
gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
@Witiko
Copy link
Contributor Author

Witiko commented Jun 22, 2021

I have implemented all suggestions from the reviews. Please, let me know if there are any other changes to make before merge.

vectors_for_all['an out-of-vocabulary word']
- vectors_for_all['responding']
)
self.assertGreater(greater_distance, smaller_distance)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertGreater(greater_distance, smaller_distance)
assert greater_distance > smaller_distance


expected = self.test_model.wv['responding']
predicted = vectors_for_all['responding']
self.assertTrue(np.allclose(expected, predicted))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertTrue(np.allclose(expected, predicted))
assert np.allclose(expected, predicted)


expected = self.vectors['conflict']
predicted = vectors_for_all['conflict']
self.assertTrue(np.allclose(expected, predicted))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertTrue(np.allclose(expected, predicted))
assert np.allclose(expected, predicted)

@mpenkov mpenkov changed the title Add a KeyedVectors method that vectorizes all words in a dictionary New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary Jun 29, 2021
Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you @Witiko !

@mpenkov mpenkov merged commit a93067d into piskvorky:develop Jun 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants