-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Labeled w2v #1153
[WIP] Labeled w2v #1153
Conversation
Just wondering, why wouldn't skip-gram be as appropriate as CBOW? |
In the CBOW we obtain a model that, given a context of words, it returns a probability distribution over a vocabulary of words (probability of appearing in that context). So we have the direct computation of a language model, while with skip-gram we do the inverse, predicting the context. You should then compute for each word (labels in fastText classifier) the probability of the specific document to classify. |
Not sure we want to start adding supervised learning (classification) to gensim. There would have to be a really clear, convincing reason for such major change-of-mission. @tmylk are you OK with this? |
@giacbrd Apologies for the delay in response. I would like to include and the necessary refactoring is on our list for this summer's Google Summer of Code. The output of a Gensim unsupervised models only becomes useful after it is put through a supervised classifier. Training jointly with a supervised layer bring better results than training separately, as shown in FastText and supervised LDA. There is also a great demand for it as shown by FastText popularity, requests for supervised LDA and the success of ShortText package that integrates gensim with sklearn/keras. |
Ping @giacbrd, what status of this PR? |
Hi, |
@giacbrd Please, reformat your docstrings according to google format. Also, need to add a short example of usage in notebook. |
@giacbrd can you summarize the use-case for this? What are the advantages of LabeledWord2Vec over FastText, or any of the classification models in scikit-learn? When would one use this class? |
LabeledWord2Vec is practically very close to the original fastText classification model, but it obviously has all the advantages of being written in Python/Cython and a familiar interface. Pros:
Cons
Given its different approach to text classification, it is a preferable alternative to many linear models; sometimes it can perform really better on specific data or domains |
In my opinion, need to add several things @giacbrd:
After this, I think we can merge thus PR, @piskvorky what do you think about it? |
@menshikh-iv yes, I was waiting for a confirmation, I see there are still doubts about the eligibility of this model for Gensim. I mean, maybe you don't want to introduce a text classification algorithm in the library? I am writing a notebook for https://github.com/giacbrd/ShallowLearn , which is a layer (a scikit-learn interface) over LabeledWord2Vec, reproducing the official tutorial of fastText (https://github.com/facebookresearch/fastText/blob/master/tutorials/supervised-learning.md) and highlighting the additional features. It could be suitable also as notebook for Gensim... |
wdyt @piskvorky? |
@piskvorky what do you think about this PR? Refactoring and optimization is a part of my gsoc timeline which starts today. It would be better to know your vision for this project. |
@prakhar2b I'd much prefer to have "native" fastText in gensim (Python/C/Cython) first (currently we only have a wrapper for the C++ code). That's an unsupervised algorithm, perfectly in line with gensim's mission (unlike supervised classification). In addition, fastText is a cool, useful algo. But I don't know how much leeway there is to change your GSoC topic. Or are these two tasks related, how much overlap is there? Any chance to do both at once? Also, what is the connection to @giacbrd 's existing work? Will you two work on this together? Or what's the difference? |
The unsupervised models of fastText are the ones described here: https://arxiv.org/abs/1607.04606 LabeledWord2Vec instead only refers to https://arxiv.org/abs/1607.01759, which is a supervised model that also exploits the "tricks" of the previous article. However, in the case of LabeledWord2Vec, I have not implemented all these tricks, i.e., subword n-grams and the hashing trick. In fact these should be implemented as generic features in Gensim, at vocabulary construction time. Subword n-grams and the hashing trick could be used by any word-vector-space based method in Gensim (just like the phrases https://radimrehurek.com/gensim/models/phrases.html). By using them with the current implementation of word2vec in Gensim, we would practically obtain the fastText unsupervised models! My opinion is that, if we want fastText unsupervised models in Gensim, word2vec should be improved following this https://arxiv.org/abs/1607.04606 and its related code. A refactorization is necessary, and an improved design of Word2vec class. E.g. the words vocabulary should be able to work with subwords, word hashes, word n-grams, ... (OOV words in general). LabeledWord2Vec is relatively simple modification of Word2vec for using a different vocabulary of words in the output layer of the network (a set of labels instead of the text words), in order to perform text classification. |
@piskvorky As for native As for the overlap with this PR, there must definitely be overlap in training codes between word2vec, unsupervised fastText, and labeledWord2Vec. And even the gsoc project was to I think it would be better to draw an outline of what features we want (labeledWord2vec is not complete facebook's fastText supervised classification), keeping in mind all three- btw, I see you are quite reluctant to add supervised classification in gensim, any specific reason for that ? cc @jayantj |
For my opinion, it's a nice feature for future 'contribute' subpackage, @giacbrd are you planning to finish this PR? |
I don't think to continue to work on this, it seems there is not an actual interest in supervised learning. Maybe after the work of @prakhar2b and a general refactoring of these models it will be worth adding LabeledWord2Vec, thas is just:
Maybe I don't understand the purpose of finalizing this PR, will it be merged to develop? As "external contribution" it is already available here https://github.com/giacbrd/ShallowLearn |
So, firstly we should to finish fasttext & refactor "common w2v code", and after it finish this PR and add it to 'contribute' subpackage. Thanks for your work @giacbrd, I'll ping your when we will be ready for it. |
I have added a new class in the module labledword2vec: LabeledWord2Vec.
The goal of this class is already described in #960
It is a subclass of Word2Vec. Here directly subclassing is not the optimal solution. It would be preferable to have a base class, something like ShallowNeuralNetwork, with subclasses LabeledWord2Vec and Word2Vec. They both share the two layer neural network concept, but the small differences make them two totally different instruments.
I preferred to minimize my intrusion in Gensim, avoiding refactoring a lot of stuff. The solution of a more complex class hierarchy did not seem trivial.