Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh mitie sklearn and add tests #152

Merged
merged 20 commits into from
Feb 21, 2017
Merged

Refresh mitie sklearn and add tests #152

merged 20 commits into from
Feb 21, 2017

Conversation

plauto
Copy link
Contributor

@plauto plauto commented Feb 12, 2017

Fixes #137

Might need to to perform a further update once #151 is accepted in order to add support for the synonyms within the interpreter

@amn41
Copy link
Contributor

amn41 commented Feb 14, 2017

Cool. Could you please also add a test for running the trained model?

Copy link
Member

@tmbo tmbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got some concerns about the code duplication. How about splitting the mitie and the spacy trainer into two classes (e.g. SpacySklearnIntentTrainer, SpacySklearnEntityTrainer and let the current SpacySklearnTrainer inherit from both). The mitie_sklearn trainer could then pick the functionality from spacy and mitie it needs.

def test_train_mitie_sklearn():
# basic conf
_config = {
'write': os.path.join(os.getcwd(), "rasa_nlu_logs.json"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use tempfile

@@ -1,5 +1,5 @@
{
"backend": "mitie",
"backend": "mitie_sklearn",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Technically this backend is meant to be faster, so we should consier defaulting it.
@amn41 what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see. Let's make another config_mitie_sklearn.json . If people want to start out by installing a single dependency - that should work. We should probably throw some warnings if training is started with the MITIE backend and > 5 intents and > 50 intent examples.

def __init__(self, metadata):
self.extractor = named_entity_extractor(metadata["entity_extractor"]) # ,metadata["feature_extractor"])
self.classifier = text_categorizer(metadata["intent_classifier"]) # ,metadata["feature_extractor"])
def __init__(self, intent_classifier=None, entity_extractor=None, feature_extractor=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming (I know its the same in the MITIEInterpreter, but lets try to improve the code base 😅): all of these three names should indicate that they are strings representing a path not the actual classifier / extractor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can append _file at the end.

return label
"""Returns the most likely intent and its probability for the input text.

:param text: text to classify
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no param text

Copy link
Contributor Author

@plauto plauto Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah tokens was used in lieu of text. Now it's fixed. Thanks! :)

:param text: text to classify
:return: tuple of most likely intent name and its probability"""
if self.classifier:
X = self.featurizer.create_bow_vecs(tokens)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment why tokens are passed to this function although the function expects sentences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokens aren't passed. The function expects sentences and we actually pass those! You will see it fixed in the next commit

self.entity_extractor = self.train_entity_extractor(data.entity_examples)
self.train_intent_classifier(data.intent_examples, test_split_size)

num_entity_examples = len([e for e in data.entity_examples if len(e["entities"]) > 0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add this logic as a method of TrainingData


def train(self, data):
def train(self, data, test_split_size=0.1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its time to move this function to the Trainer and remove the duplication in every one of the three trainers, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that's cleaner. We should add abstract methods for the train_intent_classifier & train_entity_extractor to the base class as well.

Copy link
Contributor Author

@plauto plauto Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem! I also think it will be cleaner eventually

@@ -27,46 +40,72 @@ def start_and_end(self, text_tokens, entity_tokens):
start, end = locs[0], locs[0] + len(entity_tokens)
return start, end

@classmethod
def find_entity(cls, ent, text):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplication of the code in MITIETrainer

val_tokens = tokenize(_slice)
end = start + len(val_tokens)
return start, end

def train_entity_extractor(self, entity_examples):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplication of the code in MITIETrainer

for example in intent_examples:
tokens = tokenize(example["text"])
trainer.add_labeled_text(tokens, example["intent"])
def train_intent_classifier(self, intent_examples, test_split_size=0.1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplication of the code in SpacySklearnTrainer

@amn41
Copy link
Contributor

amn41 commented Feb 16, 2017

Tom re: your comment. I agree it would be cleaner to split this up & recombine. But since there's only two cases (plain mitie and mitie_sklearn) it feels like overengineering. Maybe there's a simpler (perhaps less idiomatic) way to reduce the code duplication?

@plauto
Copy link
Contributor Author

plauto commented Feb 16, 2017

@tmbo how about then two utils files? Like one sklearn_trainer_utils and mitie_trainer_utils each one of them with an entity and intent classifier. Then you will just import what you need

@tmbo
Copy link
Member

tmbo commented Feb 16, 2017

Yes that sounds good.

Copy link
Member

@tmbo tmbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! Added one comment, after that's fixed its a 👍

ner = trainer.train()
return ner
def train_intent_classifier(self, intent_examples, test_split_size=0.1):
self.intent_classifier = SklearnIntentClassifier(max_num_threads=self.max_num_threads)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to use sklearn_trainer_utils.train_intent_classifier here?

@tmbo tmbo mentioned this pull request Feb 20, 2017
@plauto plauto merged commit debf0de into master Feb 21, 2017
@plauto plauto deleted the mitie-refresh branch February 21, 2017 21:29
tmbo added a commit that referenced this pull request Feb 23, 2017
tmbo added a commit that referenced this pull request Feb 23, 2017
taytzehao pushed a commit to taytzehao/rasa that referenced this pull request Jul 14, 2023
vcidst pushed a commit that referenced this pull request Jan 23, 2024
…-environment-variable

Enable an env variable to define wether llm promt is logged at INFO level
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants