Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Implement save_word2vec_format can for Doc2Vec #699

Closed
wants to merge 1 commit into from

Conversation

cedias
Copy link

@cedias cedias commented May 12, 2016

Hi,
I recently had to export my Doc2Vec model to W2V base format and the function save_word2vec_format wasn't implemented for D2V class and was simply calling W2V one.

Therefore I quickly made this implementation.

If you believe it's worthwhile I'll go ahead and implement the load function to properly test the save/load pipeline, let me know.

@tmylk
Copy link
Contributor

tmylk commented Jun 9, 2016

@cedias Apologies for late reply. What will be the advantage of saving in this format? What is your use case?

@cedias
Copy link
Author

cedias commented Jun 9, 2016

Well, since Doc2Vec extends Word2Vec, the save_word2vec_format function can already be used to output the word vectors. I believe it's logical to be able to output the document vector as well (At least that is what I was expecting before noticing it was only writing word vectors).

For the save_word2vec_format function the main (and probably only) advantage of this format is legacy. Personnally, I had some code which took classic w2v binary output as input and I wanted to try and input document vectors.

As for the load function i'm not really sure about the use case, besides API consistency, which is why I wanted an opinion about it :)

@gojomo
Copy link
Collaborator

gojomo commented Jun 9, 2016

I can see this being useful for others as well. Some thoughts:

  • should remain possible to also save word-vecs – perhaps even to the same file? – so superclass functionality shouldn't be completely hidden by the override
  • should also support case where user's doctags are just plain int indexes rather than strings (and thus model.docvecs.doctags is empty), or a mixture of ints and strings
  • load... makes sense as well, subject to the same above points

Achieving those may require some new conventions in the method API and on-disk format to indicate the word-vec/doc-vec distinction... if at all possible those conventions should put a minimal burden on people using older files, other tools, or just one set (word-vecs or doc-vecs) of vectors.

@cedias
Copy link
Author

cedias commented Jun 13, 2016

Ok, i'll work on it later this month then. Thanks for the tips.

@tmylk
Copy link
Contributor

tmylk commented Aug 14, 2016

Hi @cedias Would you have time to work on this for our release this month?

@tmylk tmylk added feature Issue described a new feature difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Oct 4, 2016
@cedias
Copy link
Author

cedias commented Jan 27, 2017

I believe this feature will be removed/realized in #1107

@cedias cedias closed this Jan 27, 2017
@tmylk
Copy link
Contributor

tmylk commented Jan 27, 2017

@cedias Thanks for following the github development.
The feature that you are proposing is a new feature and is not included in #1107.
That PR doesn't touch on model.docvecs and existing behaviour of simply calling Word2Vec.save_word2vec_format is preserved.

@gojomo
Copy link
Collaborator

gojomo commented Jan 27, 2017

If Doc2Vec's DocvecsArray is fully adapted to use KeyedVectors, it might then inherit a useful save_word2vec_format() implementation.

Plausibly as per my earlier comment, there could be key-munging conventions for mixing word & doc vectors into the same flattened file on save, or even disentangling them on load. (Some downstream applications might like them mixed-together.)

@parulsethi
Copy link
Contributor

Another use case - I wanted to visualize docvecs in Tensorboard which require vectors to be in text file format, and this functionality would be useful for that.

@menshikh-iv
Copy link
Contributor

Ping @cedias, what status of this PR? Will you finish it soon?

@parulsethi
Copy link
Contributor

@menshikh-iv This feature was added in #1256.

@menshikh-iv
Copy link
Contributor

Then I close this PR, I hope the author agrees with @parulsethi (If not, you can reopen it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills feature Issue described a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants