EntityExtractor in v1.10.0 yield wrong entity value for language without spaces #5755
Labels
area:rasa-oss 🎡
Anything related to the open source Rasa framework
type:enhancement ✨
Additions of new features or changes to existing ones, should be doable in a single PR
Rasa version:
v1.10.0
Python version:
3.7
Operating system:
Windows, Linux
I'm not sure if this should be a bug or feature request, but here I go.
Description of Problem:
For language that doesn't have space that separate words, any
EntityExtractor
class will return the whole sentence as thevalue
. When I see what changes have been made into that class, I see that there are few new functions that make up intoclean_up_entities
function. The problem appears from that function as I observe the process from theSpacyEntityExtractor
. Based on the comment on function_token_clusters
: "two tokens belong to the same word if there is no other character between them", I assume that all the process of theclean_up_entities
function will merge whole input if there is no space between words.Overview of the Solution:
I haven't really looked into the sub-words problem that requires this
clean_up_entities
function, so I can't really offer any solution except that I will still use v1.9.x. However, if this function will still be kept as it is for the future release, I either need to remake all of my training data to have space and preprocess the input to Rasa chatbot with custom components, or still stay on hold with version 1.9.xExamples
On previous version:
on version 1.10.0
The text was updated successfully, but these errors were encountered: