-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes issues while loading word2vec
and doc2vec
models saved using old Gensim versions. Fix #2000, #1977
#2012
Conversation
Plan:
We need to cover as much as possible situations because this kind of problems are already starting to bother. |
gensim/test/test_word2vec.py
Outdated
saved_models_dir = datapath('old_w2v_models') | ||
for old_version in old_versions: | ||
model = word2vec.Word2Vec.load(os.path.join(saved_models_dir, 'w2v_{}.mdl'.format(old_version))) | ||
self.assertTrue(len(model.wv.vocab) == 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add most_similar
+ update an model (similar for d2v)
gensim/models/deprecated/doc2vec.py
Outdated
new_model.docvecs.max_rawint = old_model.docvecs.__dict__.get('max_rawint') | ||
new_model.docvecs.offset2doctag = old_model.docvecs.__dict__.get('offset2doctag') | ||
else: | ||
new_model.docvecs.max_rawint = len(old_model.docvecs.index2doctag) if old_model.docvecs.index2doctag else old_model.docvecs.count | ||
new_model.docvecs.max_rawint = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic: definitely deserves a comment.
gensim/test/test_doc2vec.py
Outdated
doc0_inferred = model.infer_vector(list(DocsLeeCorpus())[0].words) | ||
sims_to_infer = model.docvecs.most_similar([doc0_inferred], topn=len(model.docvecs)) | ||
self.assertTrue(sims_to_infer) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add here save
+load
+infer_vector
here (to be 100% sure that this persistent correctly)? Make sure that you used /tmp
directory, check gensim.test.utils
, you'll found needed functions (and same for w2v).
Also, please try to update model (as for w2v)
gensim/test/test_doc2vec.py
Outdated
'3.0.0', '3.1.0', '3.2.0', '3.3.0', '3.4.0' | ||
] | ||
|
||
saved_models_dir = datapath('old_d2v_models') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better datapath('old_d2v_models/d2v_{}.mdl')
and format later
word2vec
and doc2vec
models saved using old Gensim versions.word2vec
and doc2vec
models saved using old Gensim versions. Fix #2000, #1977
…g old Gensim versions. Fix piskvorky#2000, piskvorky#1977 (piskvorky#2012) * adds default values for attributes * ignore values for attributes that do not exist * adds unit test * fixes default values for missing attributes for older gensim models * adds unit test cases for loading really old gensim models * adds test cases for loading all old models * adds more tests post loading * handles loading d2v models saved using version <=0.12.2 * fix `max_rawint` value and PEP8 errors * adds saving and loading back tests * adds comments and fixes `max_rawint` * fix PEP8
This PR addresses #2000 and #1977. The issues were caused due to a few missing attributes (like
min_alpha_yet_reached
,running_training_loss
) in really old Gensim versions. I have added tests to loadword2vec
model saved using Gensim version0.12.0
anddoc2vec
model saved using Gensim0.13.0
. The tests also include checking online training and a similarity search post loading these old models.