Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accuracy function for analogy test error #2455

Open
prokok opened this issue Apr 20, 2019 · 3 comments
Open

accuracy function for analogy test error #2455

prokok opened this issue Apr 20, 2019 · 3 comments

Comments

@prokok
Copy link

prokok commented Apr 20, 2019

Problem description

Analogy test for accuracy function does not work properly because of the typo (topn = False) on line 1187 in keyedvectors.py script.
(https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py)

Solution

There are 2 solutions regarding this matter: changing the typo in accuracy function or changing the most_smiliar function

  1. Change accuracy function in gensim/models/keyedvectors.py script on line 1187
    (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py)
sims = most_similar(self, positive=[b, c], negative=[a], topn=False, restrict_vocab=restrict_vocab)
==>
sims = most_similar(self, positive=[b, c], negative=[a], topn=None, restrict_vocab=restrict_vocab)
  1. Change the most_similar function in gensim/models/keyedvectors.py script on line 514,556
    (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py)
if topn is not None and topn < 1: ==> if not topn and topn<1:

AND

if topn is None: ==> if not topn:

Versions

Darwin-18.5.0-x86_64-i386-64bit
Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.2
FAST_VERSION 0
@prokok prokok changed the title accuracy function for analogy test gives back empty lists as result accuracy function for analogy test gives back empty lists. Apr 20, 2019
@prokok prokok changed the title accuracy function for analogy test gives back empty lists. accuracy function for analogy test error Apr 20, 2019
@piskvorky
Copy link
Owner

piskvorky commented Apr 20, 2019

Thanks for reporting @vlfgns5.

The problem seems to have been introduced by #2356. @Witiko can you have a look, review the logic? Are there other places similarly affected by that change?

@Witiko
Copy link
Contributor

Witiko commented Apr 23, 2019

@piskvorky You are correct. Whereas previously topn was only cast to a boolean, #2356 gave topn=0 different semantics (return an empty iterable) than topn=None (return all similarities, undocumented behavior). The PR missed callers that would use topn=False, my apologies.

According to a quick grep, there are no callers other than WordEmbeddingsKeyedVectors.accuracy that use topn=False. As @vlfgns5 suggests, we can either fix the caller by changing topn=False to topn=None, or we can make most_similar accept both False and None. Since #2356 may have broken some user code that depended on topn=False, I think changing both the caller (to enforce topn=None as the canonical usage) and most_similar (to support legacy user code) is the best option.

Note that due to changes in Python 3, @vlfgns5's fix (if not topn and topn < 1:) raises a type error when topn is None, since NoneType and int are mutually unorderable. I suggest the following fix instead: if isinstance(topn, int) and topn < 1:. A less explicit condition would be ideal.

@Witiko
Copy link
Contributor

Witiko commented Apr 23, 2019

Digging deeper into this, properly supporting most_similar(…, topn=None) needs more work. When no indexer is provided to most_similar, then topn=None (and topn=False) return the expected result. However, when an indexer (such as Annoyindexer) is provided, then the call fails, because Annoy only supports integer topn. Therefore, we need an additional fix that disables the indexer when topn is None:

< if indexer is not None:
> if indexer is not None and isinstance(topn, int):
      return indexer.most_similar(mean, topn)

This is a separate issue that has not yet been reported, but it should perhaps be fixed together with this, see PR #2461.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants