[MRG] *2Vec SaveLoad improvements #2939

piskvorky · 2020-09-09T13:12:46Z

Continued from #2892 (subsumes #2892):

Move model serialization fixups in save() / load() into _save_specials()/_load_specials(), to better handle cases when one SaveLoad contains another.
Code style & py3 migration fixes.
Resolved mysterious bloat in the pickle file of pre-4.0-dev Word2Vec models.

This PR does not include additional changes around serialization:

ensure there's at least one full test load of gensim-3.8.3 FastText/Word2Vec/Doc2Vec models that might not yet be tested
double-checking mmap options are functionally tested across 2Vec models, & fixing anything that might turn up
delete all tests of loading no-longer-supported older models – ideally, everything earlier than gensim-3.8.3 - and then potentially discarding support code thus no-longer needed

These will come in separate PRs later. In particular, we have to decide what to do with compatibility for models trained and stored pre-3.8.3.

piskvorky · 2020-09-09T14:00:03Z

@mpenkov already OKed by me & @gojomo ; feel free to merge after your review.

piskvorky · 2020-09-13T13:58:25Z

resolve whatever mysterious bloat is in the pickle file of pre-4.0-dev Word2Vec models

@gojomo I checked the pickle sizes and they seem as expected. Do you have a reproducing example?

One potential pitfall is that SaveLoad stores the main pickle under protocol=2, whereas "native" pickle in Python 3 with protocol 3 upward (3, 4, or 5). The protocol=2 files are larger. Could that be the difference you saw?

gensim/models/keyedvectors.py

gensim/models/word2vec.py

gojomo · 2020-09-13T19:44:02Z

resolve whatever mysterious bloat is in the pickle file of pre-4.0-dev Word2Vec models

@gojomo I checked the pickle sizes and they seem as expected. Do you have a reproducing example?

One potential pitfall is that SaveLoad stores the main pickle under protocol=2, whereas "native" pickle in Python 3 with protocol 3 upward (3, 4, or 5). The protocol=2 files are larger. Could that be the difference you saw?

When I last ran the exact same Word2Vec job (on something like text8 or text9) in 3.8.3 vs develop, the gensim .save() file from develop was significantly larger, suggestive of something unnecessary being serialized. Do you not see that in your tests?

piskvorky · 2020-09-13T19:59:03Z

OK, I'll double check tmr.

piskvorky · 2020-09-13T21:45:29Z

Yes, I can replicate on text8:

3.8.3

27034172 bytes in pickle protocol 2 (after training on a 66515395 raw words (49739954 effective words) took 96.0s, 518211 effective words/s).

4.0.0

68765086 bytes in pickle protocol 2 (after training on a 66515395 raw words (49737330 effective words) took 79.3s, 627379 effective words/s).

I'll investigate tomorrow.

The timings are for #2887; trained with word2vec.Word2Vec(word2vec.LineSentence('/tmp/enwik8'), size=100, window=5, min_count=2, workers=8).

gojomo · 2020-09-14T05:36:30Z

I'd especially suspect the backward-compatibility @property aliases index2word or index2entity, though even if pickle thinks those need serialization I'm not sure why they'd not use a proper ref.

I also just noticed that the KeyedVectors._load_specials() tries a pop of index2word twice where one attempt should be index2entity, though that couldn't be related to save-expansion. And, the back-compatibility setter for vectors_norm should probably trigger a loud error, like its getter (or the way getter & setter for .vocab errors). Perhaps index2word and index2entity should also loud-error to trigger old-code-updates as with .vocab and .vectors_norm.

piskvorky · 2020-09-14T08:26:55Z

This is the core of it:

In [88]: type(model.wv.index_to_key[0])
Out[88]: numpy.str_

I'll check where "numpy strings" (fixed width) creep in.

piskvorky · 2020-09-14T08:38:22Z

OK I think I found it, fixed. Have to run now, will clean up the pop & index2word etc later.

gojomo · 2020-09-21T02:35:34Z

I'm still not sure about superficial breaking changes like size => vector_size or index2word => index_to_key. Do we really have to annoy users with that?

I'll add that to the list for "IRL discussion", probably best handled & decided live.

Previous discussion of vector_size, #2698 (comment) , where I thought this concern was explained and settled.

#1777 changed some, but not all, uses of size for vector-dimensionality to vector_size. Having both in use indefinitely & having code of different eras or closely-related classes prefer one or the other is worse than either one. I thought size was OK, but it does invite confusion with other things that might be a size, like total count of vectors, so I went with vector_size everywhere.

The blah2blah naming was always of questionable Pythonicity - no underscores? Abbreviating 'to' as '2'?

KeyedVectors is in its core intentionally not word-specific – both when used for purposes like Doc2Vec where the lookup keys aren't 'words', and even when specifically used by the Word2Vec algorithm because many times, the user's 'words' aren't truly words in the training domain. So for consistency with the class name, and to lessen the dissonance of always referring to non-word things as 'words', a more generic name was appropriate.

But #1777 & other contemporaneous work starting using 'entity' as the generic term in lots of places, over the objections of myself & other reviewers. ('Entity' is incredibly generic, and doesn't match the class name: it's KeyedVectors, highlighting the central fact that they are accessed by unique keys, not EntityVectors. Neither the key nor the vector is vividly described as an 'entity'.)

The incomplete & ill-considered shift toward enitity also meant the implementing classes had a mix of properties named either index2word or index2entity - and in that confusion, some classes made one the alias of the other, and in some cases a superclass used one as the actual property while the subclass used another, leading to duplication/confusion where some instances had both properties, with one an empty-list. It was very poor design, implementation, and review.

Using index_to_key and key_to_index is Pythonic, consistent across all users of the class, and consistent with the class-name KeyedVectors. Also, it emphasizes the essential inverse relationship between index_to_key and key_to_index (formerly the messy vocab dict): they are exactly each other's reverse-mappings, and always between keys and int indexes. (While not yet guaranteed, ideally the keys could be not just strings but any valid, hashable Python objects, such as tuples or ints.) It's confusion-resistant rather than confusion-generating.

piskvorky · 2020-09-21T08:01:57Z

To be clear, I have no problem with vector_size and index_to_key.
The question is, does the naming purity outweigh the (massive) breaking of existing code and API that this will cause.

Checking "previous discussion" now. Edit: aha, you said we're already using both size and vector_size, inconsistently. So dropping one and keeping the other consistently makes sense. And you chose vector_size. OK.

Seems I'm consistent with my objections over time, at least.

gojomo · 2020-09-21T17:50:52Z

I'd also note that where it's practical & the underlying structure remains unchanged (an ordered list of lookup keys), the latest changes retain aliases for backward-compatible access:

https://github.com/RaRe-Technologies/gensim/blob/08a61e5eb6395ca4f03fa0bee9b2845f983fc3de/gensim/models/keyedvectors.py#L552-L566

Retaining size as an alias in __init__() methods would be possible but more onerous, as every method signature would then include redundant names for the same parameter, and every init method would include temporary boilerplate for using one if the other isn't present. "Ripping off the bandaid" now, at the same time as other changes, but still just the small step of a few mechanistic parameter replacements, is no worse, & probably better, than doing it later.

The biggest pain to any advanced API users is likely to be the removal of the .vocab dict, where each value was an object amenable to arbitrary expansion. The get_vecattr()/set_vecattr() mechanism offers that same potential, in a more compact & performant way, but is sufficiently different that learning/adopting the new system will require more attention to old code than just a search-and-replace names. I also expect a slightly larger 'bugtail' around that feature, as it's likely that out-in-wild patterns-of-use that we don't have covered in unit tests will exercise things in unexpected ways. But it's likely to be somewhat more capable programmers who are doing that, and the wins in generality/speed/memory-efficiency justify the one-time pain, in my opinion.

piskvorky · 2020-09-21T18:41:55Z

the latest changes retain aliases for backward-compatible access:

I actually changed this in recent commits, following your:

Perhaps index2word and index2entity should also loud-error to trigger old-code-updates as with .vocab and .vectors_norm.

They hard-fail now. I was thinking whether to keep those aliases "alive" or not… but with so many breaking changes, I also figured it is best to "rip off the bandaid" rather than introduce new deprecations (again).

The biggest pain to any advanced API users is likely to be the removal of the .vocab dict,

Can you put the rationale for reworking .vocab into https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4 ? Even if we discussed it somewhere already, in some PR, it'll be super useful to have all the changes and their back-story in one place. That Wiki page will be my starting point for all release-related write-ups, migration guide, release notes, tweets. Thanks!

piskvorky · 2020-09-21T19:17:05Z

the wins in generality/speed/memory-efficiency [of expandos] justify the one-time pain, in my opinion.

If expandos is backed by numpy arrays, won't it suffer from the same memory issues / bugs as numpy.str_ above? Numpy arrays are great for primitive types (ints, floats, etc), but not the best choice for arbitrary objects.

gojomo · 2020-09-22T05:25:07Z

If you've added hard-fails for those aliases, which I think is alright, that's in some still-pending PR.

I'll add an overview of the rationale on the wiki page.

I was unfamiliar with that limitation on string-typed numpy arrays! I'd seen object-typed numpy arrays, & had assumed Python strings were similarly flexibly handled, without fixed-length allocations. The type-imputation code should be changed to work around that, probably by using object type. The most prevalent extra attributes are simple numeric types (like ints or floats). In the other cases I can think of, they may also be ragged-sized other arrays of various types, as with HS-mode codes/points. I can't think of a current use where an actual string is needed, but the intent is that it should be supported – and I suspect both the ragged-sized arrays & strings will just best be served by using object type instead.

piskvorky · 2020-09-22T06:43:32Z

Yes, object dtype should be safer wrt memory, that stores just the references. OTOH we'll lose the type safety of "all array values are of the same type"… but I guess nobody expects that from something called expandos anyway.

If you've added hard-fails for those aliases, which I think is alright, that's in some still-pending PR.

I added them to this PR:
https://github.com/RaRe-Technologies/gensim/pull/2939/files#diff-741f69634ff2e1b2c4ce544e0d82a1e6R616

piskvorky · 2020-09-23T07:52:33Z

@mpenkov ping – please review & merge, so we can move on.

gensim/models/fasttext.py

gensim/models/keyedvectors.py

gensim/models/word2vec.py

gojomo · 2020-09-23T18:57:17Z

Happy to see the more-specific and -tangible add_vector/add_vectors renamings, & improved doc-comments!

My inline comments aren't things that should hold this up, just related concerns to discuss/address at some point.

mpenkov

Reviewed this together with @piskvorky earlier today.

In general, I think the changes are good to merge, but there are several things that could be better.

I think we should document expandos in code for the benefit of future developers and maintainers. So, add a multi-line code comment explaining what it is, what it does, and why it does things this way.

The same thing goes for specials.

It looks like the above information is available from comments in this PR (and others), but it's highly unlikely that anyone wanting to learn more about expandos and specials will know to look there.

piskvorky · 2020-09-24T10:04:43Z

My notes from our call with @mpenkov ; FYI @gojomo :

Decisions

Support loading of pre-3.8.3 models?
- Misha: better: a stand-alone script to upgrade models to the latest version… rather than class methods doing the work on each load
- Misha: also in favour of limited backward compatibility loading (just 3.8.3)
- Radim: in favour of keeping existing compatibility code; why remove it? pros/cons unclear
  => decision: talk to Gordon, make the costs (pros/cons) more explicit.
itertoolsvs it?
- prefer full name unless critical to Gordon, but w/e.
Remove subpar packages, contributed modules, wrappers?
- yes; Misha to review & axe
- what about NMF? semi-ready, but not much confidence, and its author no longer interested in finishing/maintaining.
  - options: 1. axe it; 2. invest effort & finish & fix; 3. do nothing, keep as-is => decision keep in 4.0.0, but Radim looks at option (2) later after 4.0.0

Actions

G: Finish the model serialization tickets: ?pre-3.8.3 models?, mmap, save_facebook bug
G: Finish loss tallying PR [early WIP] Fix/rationalize loss-tallying #2922? (only if feasible quickly)
R: Finish all docs:
- migration guide (wiki)
- release notes
- docstrings
- new website: integration & fix copy
- migrate & re-run tutorials and guides
- proof-read generated API docs for readability: we don't want stubs or obsolete/compatibility classes and methods like word2vec.Word2VecVocab, word2vec.Word2VecTrainables etc.
  - esp. with the new website style: correct syntax highlighting, headings, sections
- document internal code style: hanging indents, trailing commas, FIXME vs TODO, etc
M: Finish structure cleanup: remove subpackages and modules we don't want, incl. their docs
M: Release 4.0.0

@mpenkov @gojomo did I forget something?

piskvorky · 2020-09-27T10:00:12Z

I think we should document expandos in code for the benefit of future developers and maintainers. So, add a multi-line code comment explaining what it is, what it does, and why it does things this way.

@mpenkov @gojomo I improved this in 782f7ff . I didn't want to start a new PR for that, so it's a part of #2954 , where I'm fixing docs anyway.

gojomo · 2020-09-29T02:36:26Z

It's a bit awkward to discuss release todos/criteria deep in a PR that's already been merged/closed. I'd suggest creating a specific 'rollup' issue that's for meta-discussion, and in/out decisions, around the "4.0.0" release.

My main hope would be to get more eyes on the big changes via one or more "not yet for everyone" test releases, as further described in new issue #2963.

gojomo and others added 8 commits July 21, 2020 16:12

Delete .gitattributes

30af573

Merge remote-tracking branch 'upstream/develop' into develop

5c08d3e

test showing FT failure as W2V

3f7047f

set .vectors even when ngrams off

ac9126d

use _save_specials/_load_specials per type

0316084

Merge branch 'develop' into 2vec_saveload_fixes

d524fa4

docstirng fixes

49b35b7

get rid of python2 constructs

3f972a6

piskvorky added this to the 4.0.0 milestone Sep 9, 2020

piskvorky requested a review from mpenkov September 9, 2020 13:59

piskvorky mentioned this pull request Sep 9, 2020

[WIP] 2Vec SaveLoad improvements #2892

Closed

code style fixes while debugging pickle model sizes

4331ccf

Merge branch 'pickle_perambulations' into 2vec_saveload_fixes

34e77dc

mpenkov reviewed Sep 13, 2020

View reviewed changes

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved

py2 to 3: get rid of forgotten range

012d598

mpenkov reviewed Sep 13, 2020

View reviewed changes

gensim/models/word2vec.py Outdated Show resolved Hide resolved

mpenkov reviewed Sep 13, 2020

View reviewed changes

gensim/models/word2vec.py Outdated Show resolved Hide resolved

fix docs

eefe9ab

get rid of numpy.str_

1a9b646

piskvorky changed the title ~~[MRG] *2Vec SaveLoad improvements~~ [WIP] *2Vec SaveLoad improvements Sep 14, 2020

piskvorky added 2 commits September 19, 2020 11:32

Merge remote-tracking branch 'origin/develop' into 2vec_saveload_fixes

c14456d

fix index2entity, fix docs, hard-fail deprecated properties

06aef75

fix nonsensical word2vec path examples

e95ac0a

piskvorky force-pushed the 2vec_saveload_fixes branch 2 times, most recently from 5509f9e to e5057c1 Compare September 20, 2020 11:29

more doc fixes

dc9c3fc

piskvorky force-pushed the 2vec_saveload_fixes branch from e5057c1 to dc9c3fc Compare September 20, 2020 12:50

gojomo reviewed Sep 23, 2020

View reviewed changes

gensim/models/fasttext.py Show resolved Hide resolved

gojomo reviewed Sep 23, 2020

View reviewed changes

gensim/models/fasttext.py Show resolved Hide resolved

gojomo reviewed Sep 23, 2020

View reviewed changes

gensim/models/keyedvectors.py Show resolved Hide resolved

gojomo reviewed Sep 23, 2020

View reviewed changes

gensim/models/word2vec.py Show resolved Hide resolved

gojomo mentioned this pull request Sep 23, 2020

[MRG] fix save_facebook_model failure after update-vocab & other initialization streamlining #2944

Merged

mpenkov reviewed Sep 24, 2020

View reviewed changes

piskvorky mentioned this pull request Sep 24, 2020

Move BrownCorpus from word2vec to gensim.corpora #2956

Open

it => itertools, + code style fixes

da8847a

piskvorky mentioned this pull request Sep 24, 2020

Clean up OOP / stub methods #2957

Open

piskvorky merged commit c6c24ea into develop Sep 24, 2020

piskvorky deleted the 2vec_saveload_fixes branch September 24, 2020 10:45

piskvorky mentioned this pull request Sep 24, 2020

Measure performance of gensim 4.0.0 vs previous versions #2887

Closed

gojomo mentioned this pull request Sep 30, 2020

Adopting a (narrow) backward-compatibility standard; implications for 4.0.0 #2967

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] *2Vec SaveLoad improvements #2939

[MRG] *2Vec SaveLoad improvements #2939

piskvorky commented Sep 9, 2020 •

edited

Loading

piskvorky commented Sep 9, 2020

piskvorky commented Sep 13, 2020 •

edited

Loading

gojomo commented Sep 13, 2020 •

edited by piskvorky

Loading

piskvorky commented Sep 13, 2020

piskvorky commented Sep 13, 2020 •

edited

Loading

gojomo commented Sep 14, 2020

piskvorky commented Sep 14, 2020 •

edited

Loading

piskvorky commented Sep 14, 2020 •

edited

Loading

gojomo commented Sep 21, 2020 •

edited

Loading

piskvorky commented Sep 21, 2020 •

edited

Loading

gojomo commented Sep 21, 2020

piskvorky commented Sep 21, 2020 •

edited

Loading

piskvorky commented Sep 21, 2020 •

edited

Loading

gojomo commented Sep 22, 2020

piskvorky commented Sep 22, 2020 •

edited

Loading

piskvorky commented Sep 23, 2020

gojomo commented Sep 23, 2020 •

edited

Loading

mpenkov left a comment

piskvorky commented Sep 24, 2020 •

edited

Loading

piskvorky commented Sep 27, 2020 •

edited

Loading

gojomo commented Sep 29, 2020

[MRG] *2Vec SaveLoad improvements #2939

[MRG] *2Vec SaveLoad improvements #2939

Conversation

piskvorky commented Sep 9, 2020 • edited Loading

piskvorky commented Sep 9, 2020

piskvorky commented Sep 13, 2020 • edited Loading

gojomo commented Sep 13, 2020 • edited by piskvorky Loading

piskvorky commented Sep 13, 2020

piskvorky commented Sep 13, 2020 • edited Loading

3.8.3

4.0.0

gojomo commented Sep 14, 2020

piskvorky commented Sep 14, 2020 • edited Loading

piskvorky commented Sep 14, 2020 • edited Loading

gojomo commented Sep 21, 2020 • edited Loading

piskvorky commented Sep 21, 2020 • edited Loading

gojomo commented Sep 21, 2020

piskvorky commented Sep 21, 2020 • edited Loading

piskvorky commented Sep 21, 2020 • edited Loading

gojomo commented Sep 22, 2020

piskvorky commented Sep 22, 2020 • edited Loading

piskvorky commented Sep 23, 2020

gojomo commented Sep 23, 2020 • edited Loading

mpenkov left a comment

Choose a reason for hiding this comment

piskvorky commented Sep 24, 2020 • edited Loading

Decisions

Actions

piskvorky commented Sep 27, 2020 • edited Loading

gojomo commented Sep 29, 2020

piskvorky commented Sep 9, 2020 •

edited

Loading

piskvorky commented Sep 13, 2020 •

edited

Loading

gojomo commented Sep 13, 2020 •

edited by piskvorky

Loading

piskvorky commented Sep 13, 2020 •

edited

Loading

piskvorky commented Sep 14, 2020 •

edited

Loading

piskvorky commented Sep 14, 2020 •

edited

Loading

gojomo commented Sep 21, 2020 •

edited

Loading

piskvorky commented Sep 21, 2020 •

edited

Loading

piskvorky commented Sep 21, 2020 •

edited

Loading

piskvorky commented Sep 21, 2020 •

edited

Loading

piskvorky commented Sep 22, 2020 •

edited

Loading

gojomo commented Sep 23, 2020 •

edited

Loading

piskvorky commented Sep 24, 2020 •

edited

Loading

piskvorky commented Sep 27, 2020 •

edited

Loading