Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to recognize multiple entities of same type in a sentence without any separation symbol (or a single space) #6340

Closed
praneethgb opened this issue Aug 4, 2020 · 10 comments
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@praneethgb
Copy link
Contributor

praneethgb commented Aug 4, 2020

test file for reference: https://github.com/RasaHQ/rasa/blob/2b12852ae04aa2d9de6bacdc5b44d1894295fb27/tests/nlu/extractors/test_extractor.py

(
"Amsterdam Berlin and London",
{
"entity": ["city", "city", "O", "city"],
"role": ["O", "O", "O", "O"],
"group": ["O", "O", "O", "O"],
},
None,
[
{"entity": "city", "start": 0, "end": 16, "value": "Amsterdam Berlin"},
{"entity": "city", "start": 21, "end": 27, "value": "London"},
],
),

expected should be :
{"entity": "city", "start": 0, "end": 8, "value": "Amsterdam"},
{"entity": "city", "start": 9, "end": 16, "value": "Berlin"},
{"entity": "city", "start": 21, "end": 27, "value": "London"}

Because Amsterdam (U-city) and Berlin (U-city) are different city entities.

@praneethgb praneethgb added area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Aug 4, 2020
@praneethgb
Copy link
Contributor Author

praneethgb commented Aug 4, 2020

Hi @tabergma, @tmbo

would be able to provide your inputs on this issue?

@praneethgb praneethgb changed the title Unable to recognize multiple entities of same type in a sentence without any separation symbol Unable to recognize multiple entities of same type in a sentence without any separation symbol (or a single space) Aug 4, 2020
@sara-tagger
Copy link
Collaborator

Thanks for the issue, @tttthomasssss will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

@tabergma
Copy link
Contributor

tabergma commented Aug 5, 2020

@praneethgb This is expected behaviour. For more explanation see this PR and the related forum post.

@praneethgb
Copy link
Contributor Author

praneethgb commented Aug 5, 2020

Hi @tabergma,

Since "Amsterdam Berlin" is not a city name.

For Example: consider this use case, my ingredients are eggs(ingredients) lemon juice(ingredients) and milk(ingredients).
ASR output was 'my ingredients are eggs lemon juice and milk.'

When the DIET model is trained for NER, it is trained to recognize them as eggs: U-ingredients, lemon: B-ingredients, juice: L-ingredients, milk: U-ingredients.

In postprocessing also, the results expected to be eggs, milk, and lemon juice as three ingredients. Instead, ingredients eggs and lemon juice merged as one.

DIET: removing BILOU tags at https://github.com/RasaHQ/rasa/blob/master/rasa/nlu/classifiers/diet_classifier.py#L938

@tabergma
Copy link
Contributor

@praneethgb Not sure I understand what you are suggesting. I know that "Amsterdam Berlin" is not a city and it is not ideal that we capture it as one entity. The problem is that if we don't merge entities with the same tag when they appear right next to each other, we would not be able to detect "San Fransisco", for example, it would always be detected as "San" and "Fransisco" - two independent entities. Which is also not ideal.
What is your idea to support both cases? Please keep in mind that not all users are using BILOU tagging.
Also, if you add a comma in between your ingredients, everything should be extracted as expected. E.g. "my ingredients are eggs, lemon juice, and milk".

@AMR-KELEG
Copy link
Contributor

I am starting to bump into similar issues with some queries so I am sharing my thoughts (I am not exposed enough to how BILOU tagging is used in DIET/ nlu data).

@praneethgb Not sure I understand what you are suggesting. I know that "Amsterdam Berlin" is not a city and it is not ideal that we capture it as one entity. The problem is that if we don't merge entities with the same tag when they appear right next to each other, we would not be able to detect "San Fransisco", for example, it would always be detected as "San" and "Fransisco" - two independent entities. Which is also not ideal.

This might be a bit optimistic but this is how I think the model should behave:

Input Correct/expected prediction Processed prediction (merging BILOU tags)
Amsterdam Berlin [Amsterdam](U-city) [Berlin](U-city) [Amsterdam](city) [Berlin](city)
San Fransisco [San](B-city) [Fransisco](L-city) [San Fransisco](city)

What is your idea to support both cases? Please keep in mind that not all users are using BILOU tagging.

Yes, I myself still get confused by BILOU tagging but doing a mapping like the one shown above would be convenient to the users.

Also, if you add a comma in between your ingredients, everything should be extracted as expected. E.g. "my ingredients are eggs, lemon juice, and milk".

We (as developers/ engineers/ researchers) can have our set of guidelines but it's sometimes frustrating to the users who interact with the chatbot to follow a certain format (or at least it would be better if we can support more ways of writing queries i.e: eggs lemon juice and milk and eggs, lemon juice and milk as long as doing so won't hurt the model's performance ).

@tabergma
Copy link
Contributor

Yeah, I think we can update this for BILOU tagging, but I guess it will not be possible in case the model is trained without BILOU tagging. @praneethgb or @AMR-KELEG anyone of you willing to create a PR for this?

@AMR-KELEG
Copy link
Contributor

Yeah, I think we can update this for BILOU tagging, but I guess it will not be possible in case the model is trained without BILOU tagging. @praneethgb or @AMR-KELEG anyone of you willing to create a PR for this?

I will need to have a look first but yes I am willing to work on it.

@praneethgb
Copy link
Contributor Author

Also, if you add a comma in between your ingredients, everything should be extracted as expected. E.g. "my ingredients are eggs, lemon juice, and milk".

Also, If we use voice input, then comma won't be present in input at all from Automatic Speech Recognition models.

Yeah, I think we can update this for BILOU tagging, but I guess it will not be possible in case the model is trained without BILOU tagging

Yes.

@praneethgb praneethgb mentioned this issue Aug 14, 2020
4 tasks
@praneethgb
Copy link
Contributor Author

Hi @tabergma,

I've created PR: #6423 to support this use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

No branches or pull requests

4 participants