Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes issues while loading word2vec and doc2vec models saved using old Gensim versions. Fix #2000, #1977 #2012

Merged
merged 13 commits into from
Apr 12, 2018

Conversation

manneshiva
Copy link
Contributor

This PR addresses #2000 and #1977. The issues were caused due to a few missing attributes (like min_alpha_yet_reached , running_training_loss) in really old Gensim versions. I have added tests to load word2vec model saved using Gensim version 0.12.0 and doc2vec model saved using Gensim 0.13.0. The tests also include checking online training and a similarity search post loading these old models.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Apr 2, 2018

Plan:

  • Add toy models for each version (before 3.4.0), add tests for it too.
  • Fix additional errors (if happens).

We need to cover as much as possible situations because this kind of problems are already starting to bother.

saved_models_dir = datapath('old_w2v_models')
for old_version in old_versions:
model = word2vec.Word2Vec.load(os.path.join(saved_models_dir, 'w2v_{}.mdl'.format(old_version)))
self.assertTrue(len(model.wv.vocab) == 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add most_similar + update an model (similar for d2v)

new_model.docvecs.max_rawint = old_model.docvecs.__dict__.get('max_rawint')
new_model.docvecs.offset2doctag = old_model.docvecs.__dict__.get('offset2doctag')
else:
new_model.docvecs.max_rawint = len(old_model.docvecs.index2doctag) if old_model.docvecs.index2doctag else old_model.docvecs.count
new_model.docvecs.max_rawint = \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic: definitely deserves a comment.

doc0_inferred = model.infer_vector(list(DocsLeeCorpus())[0].words)
sims_to_infer = model.docvecs.most_similar([doc0_inferred], topn=len(model.docvecs))
self.assertTrue(sims_to_infer)

Copy link
Contributor

@menshikh-iv menshikh-iv Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here save+load+infer_vector here (to be 100% sure that this persistent correctly)? Make sure that you used /tmp directory, check gensim.test.utils, you'll found needed functions (and same for w2v).

Also, please try to update model (as for w2v)

'3.0.0', '3.1.0', '3.2.0', '3.3.0', '3.4.0'
]

saved_models_dir = datapath('old_d2v_models')
Copy link
Contributor

@menshikh-iv menshikh-iv Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better datapath('old_d2v_models/d2v_{}.mdl') and format later

@menshikh-iv menshikh-iv changed the title Fixes issues while loading word2vec and doc2vec models saved using old Gensim versions. Fixes issues while loading word2vec and doc2vec models saved using old Gensim versions. Fix #2000, #1977 Apr 10, 2018
@menshikh-iv menshikh-iv merged commit 2024be9 into piskvorky:develop Apr 12, 2018
darindf pushed a commit to darindf/gensim that referenced this pull request Apr 23, 2018
…g old Gensim versions. Fix piskvorky#2000, piskvorky#1977 (piskvorky#2012)

* adds default values for attributes

* ignore values for attributes that do not exist

* adds unit test

* fixes default values for missing attributes for older gensim models

* adds unit test cases for loading really old gensim models

* adds test cases for loading all old models

* adds more tests post loading

* handles loading d2v models saved using version <=0.12.2

* fix `max_rawint` value and PEP8 errors

* adds saving and loading back tests

* adds comments and fixes `max_rawint`

* fix PEP8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants