Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix similarity bug in NMSLIB indexer + documentation fixes #2899

Merged
merged 14 commits into from
Jul 30, 2020

Conversation

piskvorky
Copy link
Owner

@piskvorky piskvorky commented Jul 26, 2020

Supersedes #2821.

  • Fixed a bug where similarities.nmslib's most_similar() returned correct neighbours but with incorrect similarities.
  • Docstring clean up in the similarities.nmslib module. Follow up from Prepare gensim 3.8.3 #2820 (comment).
  • Added similarities.nmslib to our API documentation – it was missing before.
  • Renamed overly broad similarities.index to the more appropriate similarities.annoy.
  • Fixed deprecation warnings in Annoy integration: The default argument for metric will be removed in future version of Annoy. Please pass metric='angular' explicitly.
  • Went over our API reference page https://radimrehurek.com/gensim/apiref.html (visually, as users actually see it) and fixed a bunch of misformattings and errors. The summarizer subpackage in particular makes me sad every time I look at it.
  • Various fixes to code style, language, import order etc.
  • Got rid of six in places I touched (re. "removal of python2 constructs").
  • Changed version from 3.8.1 to 4.0.0.dev0; fixes Working from trunk branch 'develop', gensim.__version__ reports as 3.8.1 #2882.
  • Rebuilt Sphinx documentation

This PR also presented the opportunity to test migrating existing code to post #2698 code. Notes (CC @gojomo):

  1. Changed failing param iter => epochs
  2. Changed failing param size => vector_size
  3. Changed failing len(model.wv.vocab) => len(model.wv) in Annoy.
    • @gojomo is this valid even for fastText? What's the current contract on fastText's "size of vectors" – should len() return the total number of admissible keys (=infinity because fastText supports OOV), or count with ngrams, or just count of "full words"…?
  4. Failed with Word2Vec' object has no attribute 'init_sims.
    • Fixed by re-introducing Word2Vec.init_sims(). That method still exists in doc2vec, fasttext, keyedvectors etc – not sure why missing from word2vec? @gojomo is this an omission or done on purpose?
  5. Failed with AttributeError: module 'numpy.random' has no attribute 'default_rng'
    • Fixed by installing a newer numpy, 1.14.2 => 1.19.1. Our setup.py says >=1.11.3 ought to be enough – CC @mpenkov is this expected? I thought we're testing against the oldest supported version of our dependencies.
    • After upgrading numpy, got another error from scipy, No module named 'numpy.testing.decorators'. Fixed by upgrading scipy too, to 1.5.2.

We should cover (at least) those in our migration guide.

The list is likely not complete; it's only what I needed to successfully run our run_annoy.py tutorial & rebuild the sphinx docs.

@piskvorky piskvorky added this to the *2vec aftermath milestone Jul 26, 2020
@piskvorky piskvorky changed the title Fix similarity bug in NMSLIB indexer + documentation fixes [WIP] Fix similarity bug in NMSLIB indexer + documentation fixes Jul 26, 2020
@piskvorky piskvorky requested review from gojomo and mpenkov July 26, 2020 20:50
@piskvorky piskvorky changed the title [WIP] Fix similarity bug in NMSLIB indexer + documentation fixes [MRG] Fix similarity bug in NMSLIB indexer + documentation fixes Jul 26, 2020
total_time = 0
for _ in range(queries):
rand_vec = model.wv.vectors_norm[np.random.randint(0, len(model.wv.vocab))]
rand_vec = model.wv.vectors_norm[np.random.randint(0, len(model.wv))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since vectors_norm is no longer a persistent cache of all unit-normed vectors, but rather (for backward-compatibility) re-calculated every time it's requested, this approach will now involve a lot of redundant recalculation of the whole array. Options: (a) grab it once before the loop; (b) adapt to use get_vector(random_index, use_norm=True).

Copy link
Owner Author

@piskvorky piskvorky Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's sneaky. Isn't it better to fail outright (breaking API), rather than do an expensive operation silently, where existing code expects a cheap dict lookup?

Copy link
Owner Author

@piskvorky piskvorky Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, we want to keep backward comp, dynamically norm just one requested vector… I haven't checked the code but why would we recalc the whole array for a single __getitem__?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not recalculating for any specific request of a single vector - that does the minimum, and would be a quickie option ((b) above) to improve this code.

But if a caller asks for vectors_norm. it's now calculated & returned (without being cached). Arguably, vectors_norm was barely even part of the public API, as it wasn't guaranteed to even exist if a caller hadn't triggered init_sims() yet. I woudn't expect the problematic pattern here, requesting vectors_norm repeatedly, to be very common, especially when callers who just need single vectors have the get_vector(..., use_norm=True) option.

I suppose we could (a) cache it once requested, to be more like the old behavior, but then we have to add more logic for properly handling other invalidations of this cache elsewhere, and explicitly discarding it at times (like serialization); or (b) eliminate the backward-compatible @getter property, either leaving it blank (to generate cryptic error), or stub an explanatory error message. In that latter case, a new distinctly-name utility method could be added, calculate_unit_normed_vectors(), that avoids creating the impression this is a simple access of an already-existing property.

Copy link
Owner Author

@piskvorky piskvorky Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favour of removing vectors_norm = hard fail with a clear message (again, noob-friendly = prescription for "what to do next").

-1 on pretending we're a simple dict, when there's a substantial calculation happening behind the scenes.

Interesting I haven't noticed during the #2698 review; I wonder how many other subtle points around deprecations / compatibility I missed. The migration guide + testing will be essential.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, .vectors_norm should be a hard-fail with helpful error message, similar to the one for init_sims(). I can add these for both as part of other touch-up work, such as that in #2892.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. I'll add some tentative fix here (so I don't merge "known to be bad" code), merge this PR, and then let's continue with this in another PR, such as #2892.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo pushed now. Let me know if OK and I'll merge this PR, to be continued elsewhere.

gensim/models/doc2vec.py Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
@gojomo
Copy link
Collaborator

gojomo commented Jul 27, 2020

  1. Changed failing len(model.wv.vocab) => len(model.wv.vocab) in Annoy.

    • @gojomo is this valid even for fastText? What's the current contract on fastText's "size of vectors" – should len() return the total number of admissible keys (=infinity because fastText supports OOV), or count with ngrams, or just count of "full words"…?

The reported len() of a FastTextKeyedVectors will be that of its known vocabulary, which is inherited from the more generic KeyedVectors, and seems a reasonable value (compared to any alternative number), as it there's a plausible case for it & it will retain the ability to use the FTKV objects as drop-ins where the KV objects will also work.

.vocab should not be used: it's obsolete & the only remaining accessor throws an error: https://github.com/RaRe-Technologies/gensim/blob/344c4abe26f96ef1a7299e82719b82c8a05c67a0/gensim/models/keyedvectors.py#L560-L566

(Simulating backward-compatible behavior would be very hard & inefficient, & after the bewildering mess of different things all called 'vocab' in the prior refactoring, I preferred as little reuse of that general term as possible.)

  1. Failed with Word2Vec' object has no attribute 'init_sims.

    • Fixed by re-introducing Word2Vec.init_sims(). That method still exists in doc2vec, fasttext, keyedvectors etc – not sure why missing from word2vec? @gojomo is this an omission or done on purpose?

Per my source-line-targeted-comments, I think init_sims() should only remain on KeyedVectors as a (deprecated) hint to those adapting older code - the versions on Word2Vec, Doc2Vec, FastText should be gone, where they persist it's just remaining obsolete-code cleanup to occur.

  1. Failed with AttributeError: module 'numpy.random' has no attribute 'default_rng'

    • Fixed by installing a newer numpy, 1.14.2 => 1.19.1. Our setup.py says >=1.11.3 ought to be enough – CC @mpenkov is this expected? I thought we're testing against the oldest supported version of our dependencies.
    • After upgrading numpy, got another error from scipy, No module named 'numpy.testing.decorators'. Fixed by upgrading scipy too, to 1.5.2.

The numpy-performance-regression in the legacy PRNG led me to update code to use their new default_rng API, but I didn't update prerequisites to reflect this (and as you note, testing didn't show any errors). It looks like this default_rng API has been there since numpy 1.17.0 (released a year ago).

I personally feel it's OK (and even beneficial) for fresh gensim releases to require fairly-recent releases of those libraries that are more 'central'/large/many-contributor, like numpy, unless there's a really strong/specific reason to hold back. Lagging projects that may need older numpy/etc should also use older gensim! (And, if there were a strong economic reason some project needs a challenging configuration – such as bleeding-edge gensim but years-old numpy – the cost of creating/maintaining that idiosyncratic-configuration should fall to that project.)

@gojomo
Copy link
Collaborator

gojomo commented Jul 27, 2020

An aside on approximate-nearest-neighbor libraries: there's a recent new option from Google Research – ScaNN – which they show as far outperforming ANNOY & FB Faiss.

@piskvorky
Copy link
Owner Author

piskvorky commented Jul 27, 2020

Yeah thanks. I think I created a Gensim ticket for ScanNN when it came out, although I cannot seem to find it now.

@piskvorky
Copy link
Owner Author

I personally feel it's OK (and even beneficial) for fresh gensim releases to require fairly-recent releases of those libraries

Yes, upgrading numpy is not a problem. I'd expect our tests to fail though. I'm pretty sure we were testing against the oldest lib, explicitly, but with all the recent changes to the testing framework (azure etc) it's possible that fell through the cracks. We should put it back if possible, @mpenkov WDYT?

@piskvorky
Copy link
Owner Author

piskvorky commented Jul 27, 2020

One more idea: once we have the migration guide in place, use it on the other tutorials (beyond run_annoy.py), to validate the guide is complete.

The expectation is that all should run smoothly, no errors or deprecation warnings. A stress-test for our migration guide.

The added value is that we refresh the auto-generated notebooks & rst, because I expect many of the outputs & timings there will change with 4.0.0.

@piskvorky piskvorky force-pushed the pr2821 branch 2 times, most recently from 05282f5 to 0ee9008 Compare July 27, 2020 09:58
@piskvorky
Copy link
Owner Author

piskvorky commented Jul 28, 2020

@mpenkov I'll go ahead and merge this once #2899 (comment) and #2899 (comment) are resolved, because I'm worried about getting out of sync again. But please still review when you have a moment.

)

def get_normed_vectors(self):
# TODO: what's the way for users to get from a matrix index (integer) to the
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo FYI

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_vector(key_or_int, use_norm=True) returns an individual normed vector, by key or raw integer slot. If for some reason a user wants a KeyedVectors that only has the normed vectors, the method unit_normalize_all() (with a suitable warning about the destructiveness of this operation) is available below. (That thus matches the previous init_sims(replace=True) behavior, for those that really need it, but I believe most users of that call were just trying to save memory in a way that's no longer necessary.)

We could add a convenience method for creating a separate new KV instance with unit-normed vectors, but as that goes above-and-beyond prior behavior, & I don't offhand know anyplace that's necessary, I'd rather not bloat this class with more "maybe someone somewhere needs it" functions. (I'd much rather move functions out, first, as per concerns described in #2873.) And if this were a real need, the right way to support it would likely be a generic deep-copy of the KV followed by a unit_normalize_all() on the new KV.

Copy link
Owner Author

@piskvorky piskvorky Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The idea here was to fuse unit_normalize_all() with get_normed_vectors(): have a single function which returns a new KV instance, normalized.

I think that's cleaner than several in-situ modifiers/getters, more functional less state, but it's true the use-case is quite hypothetical.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_vector(key_or_int, use_norm=True) returns an individual normed vector, by key or raw integer slot

Hmm, I don't see that in the code. How does the "raw integer slot" option work?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the int option from the docstring, because I don't think it exists.

@@ -1454,7 +1464,7 @@ def save_word2vec_format(self, fname, fvocab=None, binary=False, total_vec=None,
if not (i == val):
break
index_id_count += 1
keys_to_write = chain(range(0, index_id_count), store_order_vocab_keys)
keys_to_write = itertools.chain(range(0, index_id_count), store_order_vocab_keys)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere, I think I've followed a growing convention I've seen elsewhere to consider itertools as a package with a 'well known standard abbreviation' of it, so import itertools as it. I'd prefer it if gensim's project guide specifically recommended this for a small set of packages: np, pd, it, and maybe pkl, as it matches wider practices & simplifies import management. (That's especially true for numpy & itertools, where the prior practice of naming exactly which functions were used meant a jump-around just-typing burden of changing the imports every time some single function was used or removed-from-use.) But whatever project style can be unambiguously declared somewhere authoritative is OK.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No preference from me. This one I changed merely because I saw two side-by-side imports import itertools; from itertools import chain, which looked weird, so I merged them into a single import itertools.

@@ -1264,14 +1263,14 @@ def save(self, *args, **kwargs):
kwargs['ignore'] = kwargs.get('ignore', ignore_attrs)
super(FastTextKeyedVectors, self).save(*args, **kwargs)

def get_vector(self, word, use_norm=False):
def get_vector(self, word, norm=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_norm was an unfortunate choice of parameter-name which I believe dated back to #1777, & I considered changing it, but didn't want to have to change all uses elsewhere & add another compatibility-burden on users.

But if you're entertaining breaking changes:

(1) not sure norm is a good name either - it makes it sound like you're requesting the vector's (scalar) norm/length;
(2) even supporting this via an optional extra parameter on a dual-use method strikes me as silly/opaque, too. Instead, have __getitem__ be the implementation of the default behavior, add a separate explicitly-named method for the unit-normed vector, like perhaps get_unit_vector(key_or_int), no optional parameters;
(3) more aggressively, maybe let users toggle the model to give either unit-vecs, or raw-vecs, as the default __getitem__ behavior, but have both get_unit_vector() and get_raw_vector() alt options for when a caller needs to specify. Then the desire you've expressed elsewhere to have unit-normed access, via plain dict-bracket-accesses, is available when toggled-on, but the raw data is never destroyed, preventing a class of errors/regrets. I'd put this thought aside as a somewhat complicated "maybe in the future" idea, but if you're OK to break get_vector() signature, & want to prioritize unit-normed access, maybe it's worth making that change now.

Copy link
Owner Author

@piskvorky piskvorky Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I find the idea of producing a new standalone (normed) KV instance cleaner than state toggles.

I'll merge here lest we accrue a monster PR again, let's revisit this in one of your other PRs.

@piskvorky
Copy link
Owner Author

piskvorky commented Jul 30, 2020

@mpenkov some of the tests unrelated to this PR (I think) started failing: https://travis-ci.org/github/RaRe-Technologies/gensim/jobs/713448950#L862

I disabled them for now and opened a new ticket to look into it: #2908.

Also, how do you check the Circle CI results? Clicking on the "Details" link here brings up a page which asks me to sign in with Github, with read&write access to my repos, which is not happening. How do you view the actual CI results?

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 7, 2020

Yes.

I thought we're testing against the oldest supported version of our dependencies.

No, that's not the case. When you pip install gensim in a clean environment, pip will pick the latest versions of the packages to install. All the CI builds occur in a clean environment, so in this case the pin was too loose.

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 7, 2020

Also, how do you check the Circle CI results? Clicking on the "Details" link here brings up a page which asks me to sign in with Github, with read&write access to my repos, which is not happening. How do you view the actual CI results?

@piskvorky For me, I just click on "Details" and I can see the results for the relevant build.

@piskvorky
Copy link
Owner Author

Weird. Somebody else will have to inspect the Circle CI results then, I cannot access them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Working from trunk branch 'develop', gensim.__version__ reports as 3.8.1
3 participants