Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity synonym #151

Merged
merged 12 commits into from
Feb 22, 2017
Merged

Entity synonym #151

merged 12 commits into from
Feb 22, 2017

Conversation

plauto
Copy link
Contributor

@plauto plauto commented Feb 12, 2017

Addresses #106

now we should be able to generate synonyms from the training data. If any entity is referenced to by one or more synonym then we create a new file named index.json where we save the mapping between a word and it's basic form.

For an example of the mapping and a description of the solution, please see the issue in attachment

self.extractor = None
self.classifier = None
if entity_extractor:
self.extractor = named_entity_extractor(entity_extractor, feature_extractor)
if intent_classifier:
self.classifier = text_categorizer(intent_classifier, feature_extractor)
self.tokenizer = MITIETokenizer()
if entity_synonyms:
self.entity_synonyms = json.loads(codecs.open(entity_synonyms, encoding='utf-8').read())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unclosed file handle

@@ -25,9 +34,12 @@ def get_entities(self, text):
expr = re.compile(_regex)
m = expr.search(text)
start, end = m.start(), m.end()
entity_value = text[start:end]
if entity_value in self.entity_synonyms:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's with different casing, e.g. would using entity_value.lower() make sense?

Copy link
Contributor Author

@plauto plauto Feb 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think it does makes sense. but we need to be careful when the token is an integer. In this case we could have problem if we do entity_value.lower()

self.extractor = named_entity_extractor(metadata["entity_extractor"]) # ,metadata["feature_extractor"])
self.classifier = text_categorizer(metadata["intent_classifier"]) # ,metadata["feature_extractor"])
self.tokenizer = MITIETokenizer()
if entity_synonyms:
self.entity_synonyms = json.loads(codecs.open(entity_synonyms, encoding='utf-8').read())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this duplicates the code for every type of backend, how about moving the synonym handling to a separate component and inherit / import the functionality?

Copy link
Contributor Author

@plauto plauto Feb 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a good idea.
Apart from separating the loading phase, we could also avoid to replicate the step in which we replace the entities by adding a method replace_synonyms and pass there a dictionary; the method itself could be static and defined within the Interpreter class that is extended by each and every interpreter

self.intent_examples.append({"text": text, "intent": intent})
self.entity_examples.append({"text": text, "intent": intent, "entities": entities})

# create synonyms dictionary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate into function

Copy link
Contributor Author

@plauto plauto Feb 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite agree with this. Technically we are parsing all the files within a directory here and some of them are used to create the synonyms (which are part of the loading_data step). Alternatively we could put this inside utils, but I am not really sure of why we would do that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't have a strong opinion on this one. i am just worried about the length of this function.

src/__init__.py Outdated
entity_value = entities[i]["value"]
if (type(entity_value) == unicode and type(entity_synonyms) == unicode and
entity_value.lower() in entity_synonyms):
entities[i]["value"] = entity_synonyms[entity_value]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not entity_value.lower() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the entity can be an integer, which doesn't have the .lower method! So we should check that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do understand the if, but the access into the dict should be entity_synonyms[entity_value.lower()] otherwise it might fail
(entity_value.lower() in entity_synonyms might be true in your if stmt, but entity_value in entity_synonyms can be false at the same time).

Concerning entities with numbers as value:

  • is there a reason why we need to support non string values for entities? @amn41
  • if we do support it, do we have an example for that in the test data? there should definitely be a test for that (esp. with this synonym logic)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a reason why we should support entities other than string. We should really add a check for this in the validate method of the TrainingData class

Copy link
Contributor Author

@plauto plauto Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically our entities could all be strings; even if the user marks something like this as entity:

          {
            "start": 10, 
            "end": 11, 
            "value": "3", 
            "entity": "fromPrice"
          }

but I think that the tokenizer itself could produce entities that are also non-unicode.
So we should definitely add a check into the validate method, but I don't think that you could avoid the unicode check, unless you iterate on the entities and "unicodify" (sorry for the horrible term) all the entity values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmbo @amn41
Maybe we could write something like this:

    @staticmethod
    def replace_synonyms(entities, entity_synonyms):
        for i in range(len(entities)):
            entity_value = unicode(entities[i]["value"])
            if entity_value.lower() in entity_synonyms:
                entities[i]["value"] = entity_synonyms[entity_value.lower()]

Copy link
Contributor

@amn41 amn41 Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait how could the tokenizer produce anything other than a list of unicode objects?

src/__init__.py Outdated

__version__ = get_distribution('rasa_nlu').version


class Interpreter(object):
def parse(self, text):
raise NotImplementedError()

@staticmethod
def load_synonyms(entity_synonyms):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entity_synonyms_file maybe? in the next function its named in the same way but represents a map whereas this is supposed to be a path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! That's definitely a better name

src/__init__.py Outdated
def replace_synonyms(entities, entity_synonyms):
for i in range(len(entities)):
entity_value = entities[i]["value"]
if (type(entity_value) == unicode and type(entity_synonyms) == unicode and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this: why is this logic so different from the one used when creating the dict in https://github.com/golastmile/rasa_nlu/pull/151/files#diff-a3ee8ed9ecce93718fad630ad21f9376R32m ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This somewhat also belongs to your comment 5 lines below.

I don't understand why at this location we only replaces values where the entity value is a string but in the referenced collection of synonyms we also collect synonyms for non-string entities.

self.intent_examples.append({"text": text, "intent": intent})
self.entity_examples.append({"text": text, "intent": intent, "entities": entities})

# create synonyms dictionary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't have a strong opinion on this one. i am just worried about the length of this function.

@plauto plauto merged commit 2aae312 into master Feb 22, 2017
@plauto plauto deleted the entity-synonym branch February 22, 2017 12:22
tmbo added a commit that referenced this pull request Feb 23, 2017
vcidst added a commit that referenced this pull request Feb 22, 2024
* do not unpack json payload if data key is not present

* add room arg to else branch

* prepared release of version 3.7.3.dev1 (#151)

* Prepare-release-3.7.3.dev2 (#164)

* prepared release of version 3.7.3.dev2

* allow dev releases without changelogs

* add changelog entry

---------

Co-authored-by: Shailendra Paliwal <hello@shailendra.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants