-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refresh mitie sklearn and add tests #152
Conversation
Cool. Could you please also add a test for running the trained model? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got some concerns about the code duplication. How about splitting the mitie and the spacy trainer into two classes (e.g. SpacySklearnIntentTrainer
, SpacySklearnEntityTrainer
and let the current SpacySklearnTrainer
inherit from both). The mitie_sklearn trainer could then pick the functionality from spacy and mitie it needs.
_pytest/test_train.py
Outdated
def test_train_mitie_sklearn(): | ||
# basic conf | ||
_config = { | ||
'write': os.path.join(os.getcwd(), "rasa_nlu_logs.json"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should use tempfile
config_mitie.json
Outdated
@@ -1,5 +1,5 @@ | |||
{ | |||
"backend": "mitie", | |||
"backend": "mitie_sklearn", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. Technically this backend is meant to be faster, so we should consier defaulting it.
@amn41 what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see. Let's make another config_mitie_sklearn.json
. If people want to start out by installing a single dependency - that should work. We should probably throw some warnings if training is started with the MITIE backend and > 5 intents and > 50 intent examples.
def __init__(self, metadata): | ||
self.extractor = named_entity_extractor(metadata["entity_extractor"]) # ,metadata["feature_extractor"]) | ||
self.classifier = text_categorizer(metadata["intent_classifier"]) # ,metadata["feature_extractor"]) | ||
def __init__(self, intent_classifier=None, entity_extractor=None, feature_extractor=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming (I know its the same in the MITIEInterpreter
, but lets try to improve the code base 😅): all of these three names should indicate that they are strings representing a path not the actual classifier / extractor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can append _file at the end.
return label | ||
"""Returns the most likely intent and its probability for the input text. | ||
|
||
:param text: text to classify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no param text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah tokens was used in lieu of text. Now it's fixed. Thanks! :)
:param text: text to classify | ||
:return: tuple of most likely intent name and its probability""" | ||
if self.classifier: | ||
X = self.featurizer.create_bow_vecs(tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a comment why tokens are passed to this function although the function expects sentences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokens aren't passed. The function expects sentences and we actually pass those! You will see it fixed in the next commit
self.entity_extractor = self.train_entity_extractor(data.entity_examples) | ||
self.train_intent_classifier(data.intent_examples, test_split_size) | ||
|
||
num_entity_examples = len([e for e in data.entity_examples if len(e["entities"]) > 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add this logic as a method of TrainingData
|
||
def train(self, data): | ||
def train(self, data, test_split_size=0.1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its time to move this function to the Trainer
and remove the duplication in every one of the three trainers, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep that's cleaner. We should add abstract methods for the train_intent_classifier
& train_entity_extractor
to the base class as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem! I also think it will be cleaner eventually
@@ -27,46 +40,72 @@ def start_and_end(self, text_tokens, entity_tokens): | |||
start, end = locs[0], locs[0] + len(entity_tokens) | |||
return start, end | |||
|
|||
@classmethod | |||
def find_entity(cls, ent, text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a duplication of the code in MITIETrainer
val_tokens = tokenize(_slice) | ||
end = start + len(val_tokens) | ||
return start, end | ||
|
||
def train_entity_extractor(self, entity_examples): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a duplication of the code in MITIETrainer
for example in intent_examples: | ||
tokens = tokenize(example["text"]) | ||
trainer.add_labeled_text(tokens, example["intent"]) | ||
def train_intent_classifier(self, intent_examples, test_split_size=0.1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a duplication of the code in SpacySklearnTrainer
Tom re: your comment. I agree it would be cleaner to split this up & recombine. But since there's only two cases (plain mitie and mitie_sklearn) it feels like overengineering. Maybe there's a simpler (perhaps less idiomatic) way to reduce the code duplication? |
@tmbo how about then two utils files? Like one sklearn_trainer_utils and mitie_trainer_utils each one of them with an entity and intent classifier. Then you will just import what you need |
Yes that sounds good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good! Added one comment, after that's fixed its a 👍
ner = trainer.train() | ||
return ner | ||
def train_intent_classifier(self, intent_examples, test_split_size=0.1): | ||
self.intent_classifier = SklearnIntentClassifier(max_num_threads=self.max_num_threads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to use sklearn_trainer_utils.train_intent_classifier
here?
…-environment-variable Enable an env variable to define wether llm promt is logged at INFO level
Fixes #137
Might need to to perform a further update once #151 is accepted in order to add support for the synonyms within the interpreter