Tokenizer for ConveRT #4978

tabergma · 2019-12-16T08:49:14Z

Implement a tokenizer for ConveRT that allows us to use embedding for ConveRT in a sequence-fashion, for example, for the CRFEntityExtractor.

Problem: ConveRT tokenizes words into subwords and adds special characters. Thus, the token start and end does not match the entities. We need to work on an alignment so that the tokens from ConveRT match the entities.

The text was updated successfully, but these errors were encountered:

tabergma added type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Dec 16, 2019

tabergma self-assigned this Dec 16, 2019

tabergma mentioned this issue Dec 17, 2019

Add ConveRTTokenizer #4984

Closed

4 tasks

tabergma mentioned this issue Jan 3, 2020

ConvertTokenizer, CLS token & features as sequence #4996

Merged

4 tasks

tabergma closed this as completed in #4996 Jan 13, 2020

JulianGerhard21 mentioned this issue Jan 30, 2020

Loss of confidence in Rasa > 1.6.0 nlu (compared to 1.4.6) #5004

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer for ConveRT #4978

Tokenizer for ConveRT #4978

tabergma commented Dec 16, 2019

Tokenizer for ConveRT #4978

Tokenizer for ConveRT #4978

Comments

tabergma commented Dec 16, 2019