Entity synonym #151

plauto · 2017-02-12T19:05:10Z

Addresses #106

now we should be able to generate synonyms from the training data. If any entity is referenced to by one or more synonym then we create a new file named index.json where we save the mapping between a word and it's basic form.

For an example of the mapping and a description of the solution, please see the issue in attachment

tmbo · 2017-02-13T10:35:09Z

src/interpreters/mitie_interpreter.py

        self.extractor = None
        self.classifier = None
        if entity_extractor:
            self.extractor = named_entity_extractor(entity_extractor, feature_extractor)
        if intent_classifier:
            self.classifier = text_categorizer(intent_classifier, feature_extractor)
        self.tokenizer = MITIETokenizer()
+        if entity_synonyms:
+            self.entity_synonyms = json.loads(codecs.open(entity_synonyms, encoding='utf-8').read())


unclosed file handle

tmbo · 2017-02-13T10:37:15Z

src/interpreters/mitie_interpreter.py

@@ -25,9 +34,12 @@ def get_entities(self, text):
                expr = re.compile(_regex)
                m = expr.search(text)
                start, end = m.start(), m.end()
+                entity_value = text[start:end]
+                if entity_value in self.entity_synonyms:


What's with different casing, e.g. would using entity_value.lower() make sense?

yes, I think it does makes sense. but we need to be careful when the token is an integer. In this case we could have problem if we do entity_value.lower()

tmbo · 2017-02-13T10:40:48Z

src/interpreters/mitie_sklearn_interpreter.py

        self.extractor = named_entity_extractor(metadata["entity_extractor"])  # ,metadata["feature_extractor"])
        self.classifier = text_categorizer(metadata["intent_classifier"])  # ,metadata["feature_extractor"])
        self.tokenizer = MITIETokenizer()
+        if entity_synonyms:
+            self.entity_synonyms = json.loads(codecs.open(entity_synonyms, encoding='utf-8').read())


Since this duplicates the code for every type of backend, how about moving the synonym handling to a separate component and inherit / import the functionality?

it's a good idea.
Apart from separating the loading phase, we could also avoid to replicate the step in which we replace the entities by adding a method replace_synonyms and pass there a dictionary; the method itself could be static and defined within the Interpreter class that is extended by each and every interpreter

tmbo · 2017-02-13T10:43:47Z

src/training_data.py

+                    self.intent_examples.append({"text": text, "intent": intent})
+                    self.entity_examples.append({"text": text, "intent": intent, "entities": entities})
+
+            # create synonyms dictionary


separate into function

I don't quite agree with this. Technically we are parsing all the files within a directory here and some of them are used to create the synonyms (which are part of the loading_data step). Alternatively we could put this inside utils, but I am not really sure of why we would do that

don't have a strong opinion on this one. i am just worried about the length of this function.

tmbo · 2017-02-14T21:40:33Z

src/__init__.py

+            entity_value = entities[i]["value"]
+            if (type(entity_value) == unicode and type(entity_synonyms) == unicode and
+                    entity_value.lower() in entity_synonyms):
+                entities[i]["value"] = entity_synonyms[entity_value]


why not entity_value.lower() ?

Sometimes the entity can be an integer, which doesn't have the .lower method! So we should check that

I do understand the if, but the access into the dict should be entity_synonyms[entity_value.lower()] otherwise it might fail
(entity_value.lower() in entity_synonyms might be true in your if stmt, but entity_value in entity_synonyms can be false at the same time).

Concerning entities with numbers as value:

is there a reason why we need to support non string values for entities? @amn41

if we do support it, do we have an example for that in the test data? there should definitely be a test for that (esp. with this synonym logic)

I can't think of a reason why we should support entities other than string. We should really add a check for this in the validate method of the TrainingData class

Technically our entities could all be strings; even if the user marks something like this as entity:

{ "start": 10, "end": 11, "value": "3", "entity": "fromPrice" }

but I think that the tokenizer itself could produce entities that are also non-unicode.
So we should definitely add a check into the validate method, but I don't think that you could avoid the unicode check, unless you iterate on the entities and "unicodify" (sorry for the horrible term) all the entity values

@tmbo @amn41
Maybe we could write something like this:

@staticmethod def replace_synonyms(entities, entity_synonyms): for i in range(len(entities)): entity_value = unicode(entities[i]["value"]) if entity_value.lower() in entity_synonyms: entities[i]["value"] = entity_synonyms[entity_value.lower()]

Wait how could the tokenizer produce anything other than a list of unicode objects?

tmbo · 2017-02-14T21:44:31Z

src/__init__.py


 __version__ = get_distribution('rasa_nlu').version


 class Interpreter(object):
    def parse(self, text):
        raise NotImplementedError()
+
+    @staticmethod
+    def load_synonyms(entity_synonyms):


entity_synonyms_file maybe? in the next function its named in the same way but represents a map whereas this is supposed to be a path.

Yes! That's definitely a better name

tmbo · 2017-02-14T22:23:07Z

src/__init__.py

+    def replace_synonyms(entities, entity_synonyms):
+        for i in range(len(entities)):
+            entity_value = entities[i]["value"]
+            if (type(entity_value) == unicode and type(entity_synonyms) == unicode and


I am not sure I understand this: why is this logic so different from the one used when creating the dict in https://github.com/golastmile/rasa_nlu/pull/151/files#diff-a3ee8ed9ecce93718fad630ad21f9376R32m ?

This somewhat also belongs to your comment 5 lines below.

I don't understand why at this location we only replaces values where the entity value is a string but in the referenced collection of synonyms we also collect synonyms for non-string entities.

tmbo · 2017-02-14T22:29:41Z

src/training_data.py

+                    self.intent_examples.append({"text": text, "intent": intent})
+                    self.entity_examples.append({"text": text, "intent": intent, "entities": entities})
+
+            # create synonyms dictionary


don't have a strong opinion on this one. i am just worried about the length of this function.

* do not unpack json payload if data key is not present * add room arg to else branch * prepared release of version 3.7.3.dev1 (#151) * Prepare-release-3.7.3.dev2 (#164) * prepared release of version 3.7.3.dev2 * allow dev releases without changelogs * add changelog entry --------- Co-authored-by: Shailendra Paliwal <hello@shailendra.me>

plauto added 3 commits February 12, 2017 18:16

Add entity synonyms

17b4605

Fix pep8

f37e896

Add tests

3e66c9b

plauto mentioned this pull request Feb 12, 2017

Refresh mitie sklearn and add tests #152

Merged

tmbo requested changes Feb 13, 2017

View reviewed changes

plauto added 3 commits February 13, 2017 21:52

Fix some style issues and clean code

5176853

Merged with Master

d6c6272

Fix pep8

504adae

tmbo reviewed Feb 14, 2017

View reviewed changes

plauto added 6 commits February 16, 2017 19:26

Fix entity dict access key

17db9de

Fix entity_synonym_file name

0aecc1b

Unicode fix

d379d23

Fix style issue

5270d31

Merge to master

5642e12

Fix persistor

b96478d

tmbo approved these changes Feb 22, 2017

View reviewed changes

plauto merged commit 2aae312 into master Feb 22, 2017

plauto deleted the entity-synonym branch February 22, 2017 12:22

tmbo added a commit that referenced this pull request Feb 23, 2017

Fixed backward compatibility of #151

f3ad9b2

amn41 mentioned this pull request Mar 2, 2017

Mapping multi-word phrases to "value" #194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entity synonym #151

Entity synonym #151

plauto commented Feb 12, 2017

tmbo Feb 13, 2017

tmbo Feb 13, 2017

plauto Feb 13, 2017 •

edited

Loading

tmbo Feb 13, 2017

plauto Feb 13, 2017 •

edited

Loading

tmbo Feb 13, 2017

plauto Feb 13, 2017 •

edited

Loading

tmbo Feb 14, 2017

tmbo Feb 14, 2017

plauto Feb 14, 2017

tmbo Feb 15, 2017

amn41 Feb 15, 2017

plauto Feb 15, 2017 •

edited

Loading

plauto Feb 15, 2017

amn41 Feb 16, 2017 •

edited

Loading

tmbo Feb 14, 2017

plauto Feb 14, 2017

tmbo Feb 14, 2017

tmbo Feb 15, 2017

tmbo Feb 14, 2017

Entity synonym #151

Entity synonym #151

Conversation

plauto commented Feb 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plauto Feb 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plauto Feb 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plauto Feb 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plauto Feb 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amn41 Feb 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plauto Feb 13, 2017 •

edited

Loading

plauto Feb 13, 2017 •

edited

Loading

plauto Feb 13, 2017 •

edited

Loading

plauto Feb 15, 2017 •

edited

Loading

amn41 Feb 16, 2017 •

edited

Loading