Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] *2Vec SaveLoad improvements #2939

Merged
merged 23 commits into from
Sep 24, 2020
Merged

[MRG] *2Vec SaveLoad improvements #2939

merged 23 commits into from
Sep 24, 2020

Conversation

piskvorky
Copy link
Owner

@piskvorky piskvorky commented Sep 9, 2020

Continued from #2892 (subsumes #2892):

  1. Move model serialization fixups in save() / load() into _save_specials()/_load_specials(), to better handle cases when one SaveLoad contains another.
  2. Code style & py3 migration fixes.
  3. Resolved mysterious bloat in the pickle file of pre-4.0-dev Word2Vec models.

This PR does not include additional changes around serialization:

  • ensure there's at least one full test load of gensim-3.8.3 FastText/Word2Vec/Doc2Vec models that might not yet be tested
  • double-checking mmap options are functionally tested across 2Vec models, & fixing anything that might turn up
  • delete all tests of loading no-longer-supported older models – ideally, everything earlier than gensim-3.8.3 - and then potentially discarding support code thus no-longer needed

These will come in separate PRs later. In particular, we have to decide what to do with compatibility for models trained and stored pre-3.8.3.

@piskvorky piskvorky added this to the 4.0.0 milestone Sep 9, 2020
@piskvorky piskvorky requested a review from mpenkov September 9, 2020 13:59
@piskvorky
Copy link
Owner Author

@mpenkov already OKed by me & @gojomo ; feel free to merge after your review.

@piskvorky
Copy link
Owner Author

piskvorky commented Sep 13, 2020

resolve whatever mysterious bloat is in the pickle file of pre-4.0-dev Word2Vec models

@gojomo I checked the pickle sizes and they seem as expected. Do you have a reproducing example?

One potential pitfall is that SaveLoad stores the main pickle under protocol=2, whereas "native" pickle in Python 3 with protocol 3 upward (3, 4, or 5). The protocol=2 files are larger. Could that be the difference you saw?

gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
@gojomo
Copy link
Collaborator

gojomo commented Sep 13, 2020

resolve whatever mysterious bloat is in the pickle file of pre-4.0-dev Word2Vec models

@gojomo I checked the pickle sizes and they seem as expected. Do you have a reproducing example?

One potential pitfall is that SaveLoad stores the main pickle under protocol=2, whereas "native" pickle in Python 3 with protocol 3 upward (3, 4, or 5). The protocol=2 files are larger. Could that be the difference you saw?

When I last ran the exact same Word2Vec job (on something like text8 or text9) in 3.8.3 vs develop, the gensim .save() file from develop was significantly larger, suggestive of something unnecessary being serialized. Do you not see that in your tests?

@piskvorky
Copy link
Owner Author

OK, I'll double check tmr.

@piskvorky
Copy link
Owner Author

piskvorky commented Sep 13, 2020

Yes, I can replicate on text8:

3.8.3

27034172 bytes in pickle protocol 2 (after training on a 66515395 raw words (49739954 effective words) took 96.0s, 518211 effective words/s).

4.0.0

68765086 bytes in pickle protocol 2 (after training on a 66515395 raw words (49737330 effective words) took 79.3s, 627379 effective words/s).

I'll investigate tomorrow.


The timings are for #2887; trained with word2vec.Word2Vec(word2vec.LineSentence('/tmp/enwik8'), size=100, window=5, min_count=2, workers=8).

@gojomo
Copy link
Collaborator

gojomo commented Sep 14, 2020

I'd especially suspect the backward-compatibility @property aliases index2word or index2entity, though even if pickle thinks those need serialization I'm not sure why they'd not use a proper ref.

I also just noticed that the KeyedVectors._load_specials() tries a pop of index2word twice where one attempt should be index2entity, though that couldn't be related to save-expansion. And, the back-compatibility setter for vectors_norm should probably trigger a loud error, like its getter (or the way getter & setter for .vocab errors). Perhaps index2word and index2entity should also loud-error to trigger old-code-updates as with .vocab and .vectors_norm.

@piskvorky
Copy link
Owner Author

piskvorky commented Sep 14, 2020

This is the core of it:

In [88]: type(model.wv.index_to_key[0])
Out[88]: numpy.str_

I'll check where "numpy strings" (fixed width) creep in.

@piskvorky
Copy link
Owner Author

piskvorky commented Sep 14, 2020

OK I think I found it, fixed. Have to run now, will clean up the pop & index2word etc later.

@piskvorky piskvorky changed the title [MRG] *2Vec SaveLoad improvements [WIP] *2Vec SaveLoad improvements Sep 14, 2020
@piskvorky piskvorky force-pushed the 2vec_saveload_fixes branch 2 times, most recently from 5509f9e to e5057c1 Compare September 20, 2020 11:29
@gojomo
Copy link
Collaborator

gojomo commented Sep 21, 2020

I'm still not sure about superficial breaking changes like size => vector_size or index2word => index_to_key. Do we really have to annoy users with that?

I'll add that to the list for "IRL discussion", probably best handled & decided live.

Previous discussion of vector_size, #2698 (comment) , where I thought this concern was explained and settled.

#1777 changed some, but not all, uses of size for vector-dimensionality to vector_size. Having both in use indefinitely & having code of different eras or closely-related classes prefer one or the other is worse than either one. I thought size was OK, but it does invite confusion with other things that might be a size, like total count of vectors, so I went with vector_size everywhere.

The blah2blah naming was always of questionable Pythonicity - no underscores? Abbreviating 'to' as '2'?

KeyedVectors is in its core intentionally not word-specific – both when used for purposes like Doc2Vec where the lookup keys aren't 'words', and even when specifically used by the Word2Vec algorithm because many times, the user's 'words' aren't truly words in the training domain. So for consistency with the class name, and to lessen the dissonance of always referring to non-word things as 'words', a more generic name was appropriate.

But #1777 & other contemporaneous work starting using 'entity' as the generic term in lots of places, over the objections of myself & other reviewers. ('Entity' is incredibly generic, and doesn't match the class name: it's KeyedVectors, highlighting the central fact that they are accessed by unique keys, not EntityVectors. Neither the key nor the vector is vividly described as an 'entity'.)

The incomplete & ill-considered shift toward enitity also meant the implementing classes had a mix of properties named either index2word or index2entity - and in that confusion, some classes made one the alias of the other, and in some cases a superclass used one as the actual property while the subclass used another, leading to duplication/confusion where some instances had both properties, with one an empty-list. It was very poor design, implementation, and review.

Using index_to_key and key_to_index is Pythonic, consistent across all users of the class, and consistent with the class-name KeyedVectors. Also, it emphasizes the essential inverse relationship between index_to_key and key_to_index (formerly the messy vocab dict): they are exactly each other's reverse-mappings, and always between keys and int indexes. (While not yet guaranteed, ideally the keys could be not just strings but any valid, hashable Python objects, such as tuples or ints.) It's confusion-resistant rather than confusion-generating.

@piskvorky
Copy link
Owner Author

piskvorky commented Sep 21, 2020

To be clear, I have no problem with vector_size and index_to_key.
The question is, does the naming purity outweigh the (massive) breaking of existing code and API that this will cause.

Checking "previous discussion" now. Edit: aha, you said we're already using both size and vector_size, inconsistently. So dropping one and keeping the other consistently makes sense. And you chose vector_size. OK.

Seems I'm consistent with my objections over time, at least.

@gojomo
Copy link
Collaborator

gojomo commented Sep 21, 2020

I'd also note that where it's practical & the underlying structure remains unchanged (an ordered list of lookup keys), the latest changes retain aliases for backward-compatible access:

https://github.com/RaRe-Technologies/gensim/blob/08a61e5eb6395ca4f03fa0bee9b2845f983fc3de/gensim/models/keyedvectors.py#L552-L566

Retaining size as an alias in __init__() methods would be possible but more onerous, as every method signature would then include redundant names for the same parameter, and every init method would include temporary boilerplate for using one if the other isn't present. "Ripping off the bandaid" now, at the same time as other changes, but still just the small step of a few mechanistic parameter replacements, is no worse, & probably better, than doing it later.

The biggest pain to any advanced API users is likely to be the removal of the .vocab dict, where each value was an object amenable to arbitrary expansion. The get_vecattr()/set_vecattr() mechanism offers that same potential, in a more compact & performant way, but is sufficiently different that learning/adopting the new system will require more attention to old code than just a search-and-replace names. I also expect a slightly larger 'bugtail' around that feature, as it's likely that out-in-wild patterns-of-use that we don't have covered in unit tests will exercise things in unexpected ways. But it's likely to be somewhat more capable programmers who are doing that, and the wins in generality/speed/memory-efficiency justify the one-time pain, in my opinion.

@piskvorky
Copy link
Owner Author

piskvorky commented Sep 21, 2020

the latest changes retain aliases for backward-compatible access:

I actually changed this in recent commits, following your:

Perhaps index2word and index2entity should also loud-error to trigger old-code-updates as with .vocab and .vectors_norm.

They hard-fail now. I was thinking whether to keep those aliases "alive" or not… but with so many breaking changes, I also figured it is best to "rip off the bandaid" rather than introduce new deprecations (again).

The biggest pain to any advanced API users is likely to be the removal of the .vocab dict,

Can you put the rationale for reworking .vocab into https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4 ? Even if we discussed it somewhere already, in some PR, it'll be super useful to have all the changes and their back-story in one place. That Wiki page will be my starting point for all release-related write-ups, migration guide, release notes, tweets. Thanks!

@piskvorky
Copy link
Owner Author

piskvorky commented Sep 21, 2020

the wins in generality/speed/memory-efficiency [of expandos] justify the one-time pain, in my opinion.

If expandos is backed by numpy arrays, won't it suffer from the same memory issues / bugs as numpy.str_ above? Numpy arrays are great for primitive types (ints, floats, etc), but not the best choice for arbitrary objects.

@gojomo
Copy link
Collaborator

gojomo commented Sep 22, 2020

If you've added hard-fails for those aliases, which I think is alright, that's in some still-pending PR.

I'll add an overview of the rationale on the wiki page.

I was unfamiliar with that limitation on string-typed numpy arrays! I'd seen object-typed numpy arrays, & had assumed Python strings were similarly flexibly handled, without fixed-length allocations. The type-imputation code should be changed to work around that, probably by using object type. The most prevalent extra attributes are simple numeric types (like ints or floats). In the other cases I can think of, they may also be ragged-sized other arrays of various types, as with HS-mode codes/points. I can't think of a current use where an actual string is needed, but the intent is that it should be supported – and I suspect both the ragged-sized arrays & strings will just best be served by using object type instead.

@piskvorky
Copy link
Owner Author

piskvorky commented Sep 22, 2020

Yes, object dtype should be safer wrt memory, that stores just the references. OTOH we'll lose the type safety of "all array values are of the same type"… but I guess nobody expects that from something called expandos anyway.

If you've added hard-fails for those aliases, which I think is alright, that's in some still-pending PR.

I added them to this PR:
https://github.com/RaRe-Technologies/gensim/pull/2939/files#diff-741f69634ff2e1b2c4ce544e0d82a1e6R616

@piskvorky
Copy link
Owner Author

@mpenkov ping – please review & merge, so we can move on.

@gojomo
Copy link
Collaborator

gojomo commented Sep 23, 2020

Happy to see the more-specific and -tangible add_vector/add_vectors renamings, & improved doc-comments!

My inline comments aren't things that should hold this up, just related concerns to discuss/address at some point.

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed this together with @piskvorky earlier today.

In general, I think the changes are good to merge, but there are several things that could be better.

I think we should document expandos in code for the benefit of future developers and maintainers. So, add a multi-line code comment explaining what it is, what it does, and why it does things this way.

The same thing goes for specials.

It looks like the above information is available from comments in this PR (and others), but it's highly unlikely that anyone wanting to learn more about expandos and specials will know to look there.

@piskvorky
Copy link
Owner Author

piskvorky commented Sep 24, 2020

My notes from our call with @mpenkov ; FYI @gojomo :

Decisions

  1. Support loading of pre-3.8.3 models?
    • Misha: better: a stand-alone script to upgrade models to the latest version… rather than class methods doing the work on each load
    • Misha: also in favour of limited backward compatibility loading (just 3.8.3)
    • Radim: in favour of keeping existing compatibility code; why remove it? pros/cons unclear
      => decision: talk to Gordon, make the costs (pros/cons) more explicit.
  2. itertoolsvs it?
    • prefer full name unless critical to Gordon, but w/e.
  3. Remove subpar packages, contributed modules, wrappers?
    • yes; Misha to review & axe
    • what about NMF? semi-ready, but not much confidence, and its author no longer interested in finishing/maintaining.
      • options: 1. axe it; 2. invest effort & finish & fix; 3. do nothing, keep as-is => decision keep in 4.0.0, but Radim looks at option (2) later after 4.0.0

Actions

  • G: Finish the model serialization tickets: ?pre-3.8.3 models?, mmap, save_facebook bug
  • G: Finish loss tallying PR [early WIP] Fix/rationalize loss-tallying #2922? (only if feasible quickly)
  • R: Finish all docs:
    • migration guide (wiki)
    • release notes
    • docstrings
    • new website: integration & fix copy
    • migrate & re-run tutorials and guides
    • proof-read generated API docs for readability: we don't want stubs or obsolete/compatibility classes and methods like word2vec.Word2VecVocab, word2vec.Word2VecTrainables etc.
      • esp. with the new website style: correct syntax highlighting, headings, sections
    • document internal code style: hanging indents, trailing commas, FIXME vs TODO, etc
  • M: Finish structure cleanup: remove subpackages and modules we don't want, incl. their docs
  • M: Release 4.0.0

@mpenkov @gojomo did I forget something?

@piskvorky piskvorky merged commit c6c24ea into develop Sep 24, 2020
@piskvorky piskvorky deleted the 2vec_saveload_fixes branch September 24, 2020 10:45
@piskvorky
Copy link
Owner Author

piskvorky commented Sep 27, 2020

I think we should document expandos in code for the benefit of future developers and maintainers. So, add a multi-line code comment explaining what it is, what it does, and why it does things this way.

@mpenkov @gojomo I improved this in 782f7ff . I didn't want to start a new PR for that, so it's a part of #2954 , where I'm fixing docs anyway.

@gojomo
Copy link
Collaborator

gojomo commented Sep 29, 2020

It's a bit awkward to discuss release todos/criteria deep in a PR that's already been merged/closed. I'd suggest creating a specific 'rollup' issue that's for meta-discussion, and in/out decisions, around the "4.0.0" release.

My main hope would be to get more eyes on the big changes via one or more "not yet for everyone" test releases, as further described in new issue #2963.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants