forked from piskvorky/gensim
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add
evaluate_word_analogies
(will replace accuracy
) method for `g…
…ensim.models.KeyedVectors` (piskvorky#1935) * Increased default restrict_vocab in accuracy The `accuracy` function evaluates the performance of word2vec models in analogy task. `restrict_vocab` parameter defines which part of the model vocabulary will be used for evaluation. The previous default was 30 000 top frequent words (analogy questions containing words beyond this threshold are simply skipped). It indeed makes sense to use some kind of limit here, as the evaluation running time depends on the size of the used vocabulary. However, 30 000 is a very small value, with typical models nowadays featuring hundreds of thousands or even millions of words in their vocabularies. This leads to unrealistic evaluation scores, calculated only on small parts of a test set and a model. Therefore, I suggest increasing the default value of `restrict_vocab` 10-fold, up to 300 000. This will be more in line with the typical vocabulary size of contemporary word embedding models, and also will be consistent with the default value for the `evaluate_word_pairs` function. Note that although the original C word2vec does mention 30 000 as a good threshold value for analogies evaluation, the default behavior of its `compute-accuracy` executable is still not to use any threshold (=evaluate on the whole vocabulary). * New word analogies method New method `evaluate_word_analogies` to solve word analogies. Implements more sensible frequency threshold and the `dummy4unknown` parameter. Also, works two times faster than the previous `accuracy` method which is now deprecated. * Mention new word analogies method in the doc * Refer to new word analogies method in word2vec.py * Removed redundant spaces * Removed more redundant spaces * Another round of space-elimination... * Code polishing. * Fix for docstring * Hide log method, fix the docstring * Docstring updated. * Removed redundant spaces. * cleanup docstrings
- Loading branch information
1 parent
5677ab3
commit 49e6abd
Showing
2 changed files
with
142 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters