-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Added function "predict_output_word" to predict the output word given the context words. Fixes issue #863. #1209
Conversation
Please add unit tests and a note in CHANGELOG.md |
@tmylk Sure. Also, I wanted to confirm if this function (just like the Also, if we are only implementing this for the hierarchical softmax scheme, then we should add the check |
Hierarchical-softmax mode is non-default, and in my experience less commonly-used. Also, this code currently interprets the individual output-slots of I'd suggest instead that the negative-sampling case be clearly and properly supported – as that has the easier interpretation (a single slot in |
@gojomo Thanks a lot for clarifying this. So, I'll change the current implementation of the function to serve negative sampling scheme first and then figure out how to report probabilities for hierarchical softmax case. |
gensim/models/word2vec.py
Outdated
if word2_indices and self.cbow_mean: | ||
l1 /= len(word2_indices) | ||
|
||
if self.negative : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please raise exception
if not self.negative:
raise RuntimeError("We have currently only implemented for negative sampling")
``
gensim/models/word2vec.py
Outdated
word2_indices.append(word.index) | ||
|
||
l1 = np_sum(self.wv.syn0[word2_indices], axis=0) | ||
if word2_indices and self.cbow_mean: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if word_vocabs is empty, then return None with a warning
gensim/models/word2vec.py
Outdated
|
||
word2_indices = [] | ||
for pos, word in enumerate(word_vocabs): | ||
word2_indices.append(word.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use list comprehension
Tests fixed by smart_open update |
Please add a unit test and a note in Changelog |
…into word_predict
…into word_predict
@@ -35,35 +37,35 @@ Improvements: | |||
* Phrases and Phraser allow a generator corpus (ELind77 [#1099](https://github.com/RaRe-Technologies/gensim/pull/1099)) | |||
* Ignore DocvecsArray.doctag_syn0norm in save. Fix #789 (@accraze,[#1053](https://github.com/RaRe-Technologies/gensim/pull/1053)) | |||
* Fix bug in LsiModel that occurs when id2word is a Python 3 dictionary. (@cvangysel,[#1103](https://github.com/RaRe-Technologies/gensim/pull/1103) | |||
* Fix broken link to paper in readme (@bhargavvader,[#1101](https://github.com/RaRe-Technologies/gensim/pull/1101)) | |||
* Lazy formatting in evaluate_word_pairs (@akutuzov,[#1084](https://github.com/RaRe-Technologies/gensim/pull/1084)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmylk please check -- or even better, introduce an automated check -- that makes sure there's no trailing whitespace in commits.
Because it then leads to confusing diffs like this one, when someone (correctly!) removes the trailing whitespace later on.
@tmylk I have made changes to CHANGELOG.md and also added a unit test as suggested by you earlier. |
Thanks for the new feature. Would be good to add it to https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb |
@tmylk Sure. I would update the IPython Notebook as well. |
@chinmayapancholi13 plan1 = ["pick-up-B", "stack-B-A", "pick-up-D", "stack-D-C"]
plan2 = ["unstack-B-A", "put-down-B", "unstack-D-C", "put-down-D"]
plan3 = ["pick-up-B", "stack-B-A", "pick-up-C", "stack-C-B", "pick-up-D", "stack-D-C"]
plan4 = ["unstack-D-C", "put-down-D", "unstack-C-B", "put-down-C", "unstack-B-A", "put-down-B"]
from gensim.models import word2vec
raw_sentences = plan1 + plan2 + plan3 + plan4
sentences = [s.split() for s in raw_sentences]
model = word2vec.Word2Vec(sentences, min_count=1, size=10, workers=4)
# pick-up-B OOO unstack-D-C put-down-D OOO stack-C-B OOO OOO
# pick-up-B stack-B-A unstack-D-C put-down-D pick-up-C stack-C-B pick-up-D stack-D-C
a = model.predict_output_word(['put-down-D', 'stack-C-B'])
print(a)
# weird???
# [('put-down-B', 0.083333336), ('stack-B-A', 0.083333336), ('unstack-C-B', 0.083333336), ('pick-up-C', 0.083333336), ('stack-C-B', 0.083333336), ('unstack-B-A', 0.083333336), ('put-down-D', 0.083333336), ('stack-D-C', 0.083333336), ('pick-up-B', 0.083333336), ('pick-up-D', 0.083333336)] |
@exoticknight Here |
@chinmayapancholi13 |
@exoticknight No problem! Let me know if you face any other problems. I'd be happy to help. :) |
@chinmayapancholi13 Hi, I checked the code and comments. From my understanding, the implementation is for CBOW. So is it right that given 'emergency','beacon','received' (from the tutorial), the output is the center word either between 'emergency' and 'beacon' or between 'beacon' and 'received'. Because I didn't see your discussions at first, I used the implementation to predict the next word given a list of words. |
@yzexeter Hey! Yes, this implementation is for CBOW, as mentioned in the original issue. This means that the list of words ( |
@chinmayapancholi13 Thank you for your explanation. As it is based on CBOW, what if I use skip-gram to train the model (sg = 1). It doesn't output warning, I assume it works. does the prediction under skip-gram have other meanings. is it still the center word of the context list? |
@yzexeter In CBOW, we train our model to predict the center (target) word correctly given the context words. On the other hand, in case of skip-gram, we train our model to train the context words correctly given a particular word as input. |
@chinmayapancholi13 Thank you. This explains my results after I used skip-gram model to predict output. will this be further implemented along with skip-gram. I will keep track of future modification of the implementation. |
@yzexeter Hey! Sorry for the late response. As I had mentioned above, the input format of this function (i.e a list of context words) doesn't really cohere with the skip-gram model (which predicts things the other way around i.e. predicts context words given the central word). I guess there can be a separate function for skip-gram model for doing this, but I don't have any plans right now to extend the same function to do that. :) |
@chinmayapancholi13 the difference between "focus word predicts all context window words" or "all context window words are used to predict focus word" ultimately isn't that significant - in the end, all the exact same "input word -> predicted word" pairs are used for training, just in a slightly different order. (IIRC, the word2vec paper describes it one way, but the Google word2vec.c code does it the other way, because they found slightly better CPU cache utilization patterns & thus bulk performance the way the code does it.) A skip-gram predict-word function would need the exact same context-window input - but would presumably calculate the individual predictions for every context word, then average all those predictions – even more expensive, by a factor of (Either the CBOW or SG predictions should perhaps also simulate the distance-weighting that occurs during training. During training passes, windows aren't actually |
@chinmayapancholi13 No problem. Thanks for your help. :) |
This PR adds a function
predict_output_word
, to the classWord2Vec
, which runs the trained model and reports the probability values of the possible output words. This fixes #863.