[WIP] Implement save_word2vec_format can for Doc2Vec #699

cedias · 2016-05-12T12:08:19Z

Hi,
I recently had to export my Doc2Vec model to W2V base format and the function save_word2vec_format wasn't implemented for D2V class and was simply calling W2V one.

Therefore I quickly made this implementation.

If you believe it's worthwhile I'll go ahead and implement the load function to properly test the save/load pipeline, let me know.

tmylk · 2016-06-09T15:19:56Z

@cedias Apologies for late reply. What will be the advantage of saving in this format? What is your use case?

cedias · 2016-06-09T15:42:50Z

Well, since Doc2Vec extends Word2Vec, the save_word2vec_format function can already be used to output the word vectors. I believe it's logical to be able to output the document vector as well (At least that is what I was expecting before noticing it was only writing word vectors).

For the save_word2vec_format function the main (and probably only) advantage of this format is legacy. Personnally, I had some code which took classic w2v binary output as input and I wanted to try and input document vectors.

As for the load function i'm not really sure about the use case, besides API consistency, which is why I wanted an opinion about it :)

gojomo · 2016-06-09T16:07:14Z

I can see this being useful for others as well. Some thoughts:

should remain possible to also save word-vecs – perhaps even to the same file? – so superclass functionality shouldn't be completely hidden by the override
should also support case where user's doctags are just plain int indexes rather than strings (and thus model.docvecs.doctags is empty), or a mixture of ints and strings
load... makes sense as well, subject to the same above points

Achieving those may require some new conventions in the method API and on-disk format to indicate the word-vec/doc-vec distinction... if at all possible those conventions should put a minimal burden on people using older files, other tools, or just one set (word-vecs or doc-vecs) of vectors.

cedias · 2016-06-13T06:36:05Z

Ok, i'll work on it later this month then. Thanks for the tips.

tmylk · 2016-08-14T19:14:27Z

Hi @cedias Would you have time to work on this for our release this month?

cedias · 2017-01-27T09:22:16Z

I believe this feature will be removed/realized in #1107

tmylk · 2017-01-27T15:42:18Z

@cedias Thanks for following the github development.
The feature that you are proposing is a new feature and is not included in #1107.
That PR doesn't touch on model.docvecs and existing behaviour of simply calling Word2Vec.save_word2vec_format is preserved.

gojomo · 2017-01-27T23:13:49Z

If Doc2Vec's DocvecsArray is fully adapted to use KeyedVectors, it might then inherit a useful save_word2vec_format() implementation.

Plausibly as per my earlier comment, there could be key-munging conventions for mixing word & doc vectors into the same flattened file on save, or even disentangling them on load. (Some downstream applications might like them mixed-together.)

parulsethi · 2017-04-01T19:45:38Z

Another use case - I wanted to visualize docvecs in Tensorboard which require vectors to be in text file format, and this functionality would be useful for that.

menshikh-iv · 2017-06-13T08:52:41Z

Ping @cedias, what status of this PR? Will you finish it soon?

parulsethi · 2017-06-13T16:53:00Z

@menshikh-iv This feature was added in #1256.

menshikh-iv · 2017-06-13T18:14:44Z

Then I close this PR, I hope the author agrees with @parulsethi (If not, you can reopen it)

save_word2vec_format can save documents vectors

748992a

tmylk added feature Issue described a new feature difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Oct 4, 2016

cedias closed this Jan 27, 2017

tmylk reopened this Jan 27, 2017

tmylk mentioned this pull request Jan 27, 2017

Doc2Vec .save_word2vec_format() doesn't save everything. #1110

Closed

parulsethi mentioned this pull request Apr 1, 2017

Added save method for doc2vec #1256

Merged

menshikh-iv closed this Jun 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Implement save_word2vec_format can for Doc2Vec #699

[WIP] Implement save_word2vec_format can for Doc2Vec #699

cedias commented May 12, 2016

tmylk commented Jun 9, 2016

cedias commented Jun 9, 2016

gojomo commented Jun 9, 2016

cedias commented Jun 13, 2016

tmylk commented Aug 14, 2016

cedias commented Jan 27, 2017

tmylk commented Jan 27, 2017

gojomo commented Jan 27, 2017

parulsethi commented Apr 1, 2017

menshikh-iv commented Jun 13, 2017

parulsethi commented Jun 13, 2017

menshikh-iv commented Jun 13, 2017

[WIP] Implement save_word2vec_format can for Doc2Vec #699

[WIP] Implement save_word2vec_format can for Doc2Vec #699

Conversation

cedias commented May 12, 2016

tmylk commented Jun 9, 2016

cedias commented Jun 9, 2016

gojomo commented Jun 9, 2016

cedias commented Jun 13, 2016

tmylk commented Aug 14, 2016

cedias commented Jan 27, 2017

tmylk commented Jan 27, 2017

gojomo commented Jan 27, 2017

parulsethi commented Apr 1, 2017

menshikh-iv commented Jun 13, 2017

parulsethi commented Jun 13, 2017

menshikh-iv commented Jun 13, 2017