-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entity synonym #151
Entity synonym #151
Conversation
self.extractor = None | ||
self.classifier = None | ||
if entity_extractor: | ||
self.extractor = named_entity_extractor(entity_extractor, feature_extractor) | ||
if intent_classifier: | ||
self.classifier = text_categorizer(intent_classifier, feature_extractor) | ||
self.tokenizer = MITIETokenizer() | ||
if entity_synonyms: | ||
self.entity_synonyms = json.loads(codecs.open(entity_synonyms, encoding='utf-8').read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unclosed file handle
@@ -25,9 +34,12 @@ def get_entities(self, text): | |||
expr = re.compile(_regex) | |||
m = expr.search(text) | |||
start, end = m.start(), m.end() | |||
entity_value = text[start:end] | |||
if entity_value in self.entity_synonyms: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's with different casing, e.g. would using entity_value.lower()
make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think it does makes sense. but we need to be careful when the token is an integer. In this case we could have problem if we do entity_value.lower()
self.extractor = named_entity_extractor(metadata["entity_extractor"]) # ,metadata["feature_extractor"]) | ||
self.classifier = text_categorizer(metadata["intent_classifier"]) # ,metadata["feature_extractor"]) | ||
self.tokenizer = MITIETokenizer() | ||
if entity_synonyms: | ||
self.entity_synonyms = json.loads(codecs.open(entity_synonyms, encoding='utf-8').read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this duplicates the code for every type of backend, how about moving the synonym handling to a separate component and inherit / import the functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a good idea.
Apart from separating the loading phase, we could also avoid to replicate the step in which we replace the entities by adding a method replace_synonyms
and pass there a dictionary; the method itself could be static and defined within the Interpreter
class that is extended by each and every interpreter
self.intent_examples.append({"text": text, "intent": intent}) | ||
self.entity_examples.append({"text": text, "intent": intent, "entities": entities}) | ||
|
||
# create synonyms dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate into function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite agree with this. Technically we are parsing all the files within a directory here and some of them are used to create the synonyms (which are part of the loading_data step). Alternatively we could put this inside utils, but I am not really sure of why we would do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't have a strong opinion on this one. i am just worried about the length of this function.
src/__init__.py
Outdated
entity_value = entities[i]["value"] | ||
if (type(entity_value) == unicode and type(entity_synonyms) == unicode and | ||
entity_value.lower() in entity_synonyms): | ||
entities[i]["value"] = entity_synonyms[entity_value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not entity_value.lower()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes the entity can be an integer, which doesn't have the .lower method! So we should check that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do understand the if, but the access into the dict should be entity_synonyms[entity_value.lower()]
otherwise it might fail
(entity_value.lower() in entity_synonyms
might be true in your if stmt, but entity_value in entity_synonyms
can be false at the same time).
Concerning entities with numbers as value:
- is there a reason why we need to support non string values for entities? @amn41
- if we do support it, do we have an example for that in the test data? there should definitely be a test for that (esp. with this synonym logic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a reason why we should support entities other than string. We should really add a check for this in the validate
method of the TrainingData
class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically our entities could all be strings; even if the user marks something like this as entity:
{
"start": 10,
"end": 11,
"value": "3",
"entity": "fromPrice"
}
but I think that the tokenizer itself could produce entities that are also non-unicode.
So we should definitely add a check into the validate
method, but I don't think that you could avoid the unicode check, unless you iterate on the entities and "unicodify" (sorry for the horrible term) all the entity values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait how could the tokenizer produce anything other than a list of unicode
objects?
src/__init__.py
Outdated
|
||
__version__ = get_distribution('rasa_nlu').version | ||
|
||
|
||
class Interpreter(object): | ||
def parse(self, text): | ||
raise NotImplementedError() | ||
|
||
@staticmethod | ||
def load_synonyms(entity_synonyms): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entity_synonyms_file
maybe? in the next function its named in the same way but represents a map whereas this is supposed to be a path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! That's definitely a better name
src/__init__.py
Outdated
def replace_synonyms(entities, entity_synonyms): | ||
for i in range(len(entities)): | ||
entity_value = entities[i]["value"] | ||
if (type(entity_value) == unicode and type(entity_synonyms) == unicode and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand this: why is this logic so different from the one used when creating the dict in https://github.com/golastmile/rasa_nlu/pull/151/files#diff-a3ee8ed9ecce93718fad630ad21f9376R32m ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This somewhat also belongs to your comment 5 lines below.
I don't understand why at this location we only replaces values where the entity value is a string but in the referenced collection of synonyms we also collect synonyms for non-string entities.
self.intent_examples.append({"text": text, "intent": intent}) | ||
self.entity_examples.append({"text": text, "intent": intent, "entities": entities}) | ||
|
||
# create synonyms dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't have a strong opinion on this one. i am just worried about the length of this function.
* do not unpack json payload if data key is not present * add room arg to else branch * prepared release of version 3.7.3.dev1 (#151) * Prepare-release-3.7.3.dev2 (#164) * prepared release of version 3.7.3.dev2 * allow dev releases without changelogs * add changelog entry --------- Co-authored-by: Shailendra Paliwal <hello@shailendra.me>
Addresses #106
now we should be able to generate synonyms from the training data. If any entity is referenced to by one or more synonym then we create a new file named
index.json
where we save the mapping between a word and it's basic form.For an example of the mapping and a description of the solution, please see the issue in attachment