-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text classification workflow #1025
Text classification workflow #1025
Conversation
dd7153f
to
edd781d
Compare
@gheinrich , in encode_entry(), if one char is encoded as k, and another char is encoded as k+1, does that imply these two char's are close to each other? I mean encoding characters into scalar, rather than one-hot-encoding, seems not common. For text classification, word2vec seems a more popular to encode document into vector space. |
Hi @IsaacYangSLA, in the example network that I provide, the first layer is doing one-hot encoding of the characters. I chose to do the one-hot encoding in the network rather than in the dataset because that results in a much more compact dataset, especially if you have a large alphabet. It's just my opinion, but I think "word2vec" kind of defeats the purpose of deep learning: you need logic outside of the network, like stemming algorithms, to identify words. I suppose the popularity of word2vec comes down to the limited memory/compute capabilities. I think a character-level representation of the data should be ultimately more powerful, similar to how Deep Neural Nets outperform HOG+SVM in image classification. |
Hi @gheinrich , thanks for the information of first layer. That's a better design, I agree. For the word2vec or character part, it seems more people use word2vec in NLP applications and the idea behind it is also reasonable, i.e. the concept of DC - USA + France ~= Paris. However, I see increasing researches are now on character-based text processing. Maybe in a few years, it will outperform word2vec in NLP applications. |
edd781d
to
74039af
Compare
rebased on tip of master branch |
author="Greg Heinrich", | ||
description=("A data ingestion plugin for text classification"), | ||
long_description=read('README'), | ||
license="Apache", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Apache and not BSD-3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really think of it - good point, I'll use the same license as the top-level setup.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done on latest commit
scores = output_data[output_data.keys()[0]].astype('float32') | ||
|
||
if self.terminal_layer_type == "logsoftmax": | ||
scores = np.exp(scores) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you just check to see if the values sum to 1 instead of having this form field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea (I guess I can also check if values are positive or negative)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done on latest commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran through the text classification example using these plugins and it worked great. I'd like to see the logsoftmax
thing removed before merge since it will make this a bit easier to use.
74039af
to
f8447b0
Compare
f8447b0
to
e7745fa
Compare
I have updated the text classification example to show how to use the plug-ins |
|
||
if np.max(scores) < 0: | ||
# terminal layer is a logsoftmax | ||
scores = np.exp(scores) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
```sh | ||
$ pip install $DIGITS_ROOT/plugins/data/textClassification | ||
$ pip install $DIGITS_ROOT/plugins/view/textClassification | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
…-workflow Text classification workflow
Depends on #927 (support for plug-ins) and #1024 (modularization of inference).