Tokenizers don't split words into sub-words #5756

tabergma · 2020-04-30T08:43:53Z

Proposed changes:
To avoid the problem of our entity extractors predicting entity labels for just a part of the words, we introduced a cleaning method after the prediction was done. However, we should avoid the incorrect prediction in the first place.
To achieve this we will not tokenize words into sub-words anymore. We take the mean feature vectors of the sub-words as the feature vector of the word.

fixes #5755
closes https://github.com/RasaHQ/research/issues/83

Status (please check what you already did):

added some tests for the functionality
updated the documentation
updated the changelog (please check changelog for instructions)
reformat files using black (please check Readme for instructions)

dakshvar22 · 2020-04-30T09:01:40Z

@tabergma Do we have any performance numbers with and without this fix?

tabergma · 2020-04-30T09:04:08Z

Yes, I tested it on carbon bot and the results were the same (77.2% vs 77.8% for entities - branch composite entities). Also, verified locally on the smaller example bots, that the prediction are not on sub-tokens anymore.

tabergma · 2020-04-30T11:27:58Z

Results on Sara (2 fold cross validation):
master - micro f1: 83.8
fix-tokenization - micro f1: 85.5

dakshvar22

Tested on carbon bot with BERT as well. Performance for entities improves by 2 points 🚀
Just one comment for an additional test.

tests/utils/test_train_utils.py

dakshvar22

Looks good! 🌟

tabergma added 8 commits April 30, 2020 10:36

take mean vec of sub-tokens for ConveRT

260b282

take mean vec of sub-tokens for HF models

a1f1aaa

clean up doc strings

a23b279

remove unused method

5baa02e

remove unused method

b878ad6

update doc strings

a7bf4c1

fix tests

fe1b77f

add changelog

3588b5f

tabergma requested a review from dakshvar22 April 30, 2020 08:54

tabergma added 2 commits April 30, 2020 11:38

remove unused imports

25a3635

update test

562bff7

dakshvar22 requested changes May 4, 2020

View reviewed changes

tests/utils/test_train_utils.py Show resolved Hide resolved

add test for HFTransfomerNLP

3d6653c

dakshvar22 reviewed May 4, 2020

View reviewed changes

tests/utils/test_train_utils.py Outdated Show resolved Hide resolved

tabergma added 2 commits May 4, 2020 11:26

fix import

35b546d

update tests

9270738

tabergma changed the base branch from 1.10.x to master May 4, 2020 12:38

tabergma added 2 commits May 4, 2020 14:38

Merge branch 'master' into fix-tokenization

6ce0bea

model breaking: upate version

6a8f2b5

tabergma requested a review from dakshvar22 May 5, 2020 06:49

dakshvar22 approved these changes May 5, 2020

View reviewed changes

Merge branch 'master' into fix-tokenization

ae2ad82

tabergma merged commit 22f8f88 into master May 5, 2020

tabergma deleted the fix-tokenization branch May 5, 2020 08:30

tabergma mentioned this pull request Jun 9, 2020

DIET classifier _predict_entities function clean_up_entities for Chinese language issue #5972

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizers don't split words into sub-words #5756

Tokenizers don't split words into sub-words #5756

tabergma commented Apr 30, 2020 •

edited

Loading

dakshvar22 commented Apr 30, 2020

tabergma commented Apr 30, 2020

tabergma commented Apr 30, 2020

dakshvar22 left a comment

dakshvar22 left a comment

Tokenizers don't split words into sub-words #5756

Tokenizers don't split words into sub-words #5756

Conversation

tabergma commented Apr 30, 2020 • edited Loading

dakshvar22 commented Apr 30, 2020

tabergma commented Apr 30, 2020

tabergma commented Apr 30, 2020

dakshvar22 left a comment

Choose a reason for hiding this comment

dakshvar22 left a comment

Choose a reason for hiding this comment

tabergma commented Apr 30, 2020 •

edited

Loading