-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizers don't split words into sub-words #5756
Conversation
@tabergma Do we have any performance numbers with and without this fix? |
Yes, I tested it on carbon bot and the results were the same (77.2% vs 77.8% for entities - branch composite entities). Also, verified locally on the smaller example bots, that the prediction are not on sub-tokens anymore. |
Results on Sara (2 fold cross validation): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on carbon bot with BERT as well. Performance for entities improves by 2 points 🚀
Just one comment for an additional test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 🌟
Proposed changes:
To avoid the problem of our entity extractors predicting entity labels for just a part of the words, we introduced a cleaning method after the prediction was done. However, we should avoid the incorrect prediction in the first place.
To achieve this we will not tokenize words into sub-words anymore. We take the mean feature vectors of the sub-words as the feature vector of the word.
fixes #5755
closes https://github.com/RasaHQ/research/issues/83
Status (please check what you already did):
black
(please check Readme for instructions)