Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text classification workflow #1025

Merged
merged 3 commits into from
Oct 25, 2016

Conversation

gheinrich
Copy link
Contributor

@gheinrich gheinrich commented Aug 31, 2016

Depends on #927 (support for plug-ins) and #1024 (modularization of inference).

@IsaacYangSLA
Copy link
Contributor

@gheinrich , in encode_entry(), if one char is encoded as k, and another char is encoded as k+1, does that imply these two char's are close to each other? I mean encoding characters into scalar, rather than one-hot-encoding, seems not common. For text classification, word2vec seems a more popular to encode document into vector space.

@gheinrich
Copy link
Contributor Author

Hi @IsaacYangSLA, in the example network that I provide, the first layer is doing one-hot encoding of the characters. I chose to do the one-hot encoding in the network rather than in the dataset because that results in a much more compact dataset, especially if you have a large alphabet.

It's just my opinion, but I think "word2vec" kind of defeats the purpose of deep learning: you need logic outside of the network, like stemming algorithms, to identify words. I suppose the popularity of word2vec comes down to the limited memory/compute capabilities. I think a character-level representation of the data should be ultimately more powerful, similar to how Deep Neural Nets outperform HOG+SVM in image classification.

@IsaacYangSLA
Copy link
Contributor

Hi @gheinrich , thanks for the information of first layer. That's a better design, I agree.

For the word2vec or character part, it seems more people use word2vec in NLP applications and the idea behind it is also reasonable, i.e. the concept of DC - USA + France ~= Paris. However, I see increasing researches are now on character-based text processing. Maybe in a few years, it will outperform word2vec in NLP applications.

@gheinrich gheinrich mentioned this pull request Sep 20, 2016
@gheinrich gheinrich force-pushed the dev/text-classification-workflow branch from edd781d to 74039af Compare October 5, 2016 13:00
@gheinrich
Copy link
Contributor Author

rebased on tip of master branch

author="Greg Heinrich",
description=("A data ingestion plugin for text classification"),
long_description=read('README'),
license="Apache",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Apache and not BSD-3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really think of it - good point, I'll use the same license as the top-level setup.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done on latest commit

scores = output_data[output_data.keys()[0]].astype('float32')

if self.terminal_layer_type == "logsoftmax":
scores = np.exp(scores)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just check to see if the values sum to 1 instead of having this form field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea (I guess I can also check if values are positive or negative)!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done on latest commit

@lukeyeager lukeyeager self-assigned this Oct 20, 2016
Copy link
Member

@lukeyeager lukeyeager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran through the text classification example using these plugins and it worked great. I'd like to see the logsoftmax thing removed before merge since it will make this a bit easier to use.

@lukeyeager lukeyeager removed their assignment Oct 20, 2016
@gheinrich gheinrich force-pushed the dev/text-classification-workflow branch from 74039af to f8447b0 Compare October 25, 2016 09:14
@gheinrich gheinrich force-pushed the dev/text-classification-workflow branch from f8447b0 to e7745fa Compare October 25, 2016 09:32
@gheinrich
Copy link
Contributor Author

I have updated the text classification example to show how to use the plug-ins


if np.max(scores) < 0:
# terminal layer is a logsoftmax
scores = np.exp(scores)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

```sh
$ pip install $DIGITS_ROOT/plugins/data/textClassification
$ pip install $DIGITS_ROOT/plugins/view/textClassification
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lukeyeager lukeyeager merged commit 03ed004 into NVIDIA:master Oct 25, 2016
@gheinrich gheinrich deleted the dev/text-classification-workflow branch November 30, 2016 16:49
ethantang95 pushed a commit to ethantang95/DIGITS that referenced this pull request Jul 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants