You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have seen issues #773 and #881 . I had been trying to use spaCy for training an entity recognizer like those of these bot APIs (e.g., Google API.ai, WIT.ai, ...). I have a function like this: (the main loop is basically copied from the tutorial)
1deftrain_entities(nlp, path, n_iterations=5):
2nlp.entity=EntityRecognizer(nlp.vocab, entity_types=['GPE'])
34train_data=load_train_data(path)
56iftrain_dataisNone:
7returnnlp.entity89# Very much based on10# https://spacy.io/docs/usage/entity-recognition#updating11foritninrange(n_iterations):
12random.shuffle(train_data)
13forraw_text, entity_offsetsintrain_data:
14doc=nlp.make_doc(raw_text)
15gold=GoldParse(doc, entities=entity_offsets)
1617nlp.tagger(doc)
18nlp.entity.update(doc, gold)
1920nlp.entity.model.end_training()
21returnnlp
where load_train_data() gives me vectors in the format of the tutorial. I get the entities by running processed_sentence = nlp(sentence) (where sentence is a unicode string), and then accessing processed_sentence.ents. I had initially tried with just a few examples, and it didn't work. Then I read this (in #773 ):
I get that people want to train on a few dozen sentences. I think people shouldn't want that.
So I thought I would try to overfit the model by training it in the same sentences for a big number of epochs and see what happens. I found lots of addresses in http://results.openaddresses.io/. I chose the addresses in Thüringen (Germany) and randomly picked 25000 addresses there. I want to cause the entity recognizer to take these as "GPE". Using 12 small sentence templates, I generated 23105 sentences using these addresses, with taggings saying in which character a GPE started and in which character it ended (just like in the tutorial). There are less sentences because some sentences required two addresses (they are like "I moved from {} to {}").
Finally, I trained the entity recognizer (using the function above and this dataset) for 5, 20, 50 and 100 epochs. Still, it seems all this training didn't make any difference. I.e., when I try any of the sentences in my training set (the sentences I trained it on), it still gives me the same results it used to give without any training. E.g., the sentence
Waldstraße, 10, Dermbach is where I live
(which is one of the sentences in the training set)
still gives me these three entities (the following printing formatting is for my convenience)
Sorry that this has been a bit unstable. The code in Thinc 6.5.0 is training properly for me -- I'm currently working on getting new models up for the next release of spaCy.
So, try either updating to the latest thinc, or at least remove the call to .end_training().
I have seen issues #773 and #881 . I had been trying to use spaCy for training an entity recognizer like those of these bot APIs (e.g., Google API.ai, WIT.ai, ...). I have a function like this: (the main loop is basically copied from the tutorial)
where
load_train_data()
gives me vectors in the format of the tutorial. I get the entities by runningprocessed_sentence = nlp(sentence)
(wheresentence
is a unicode string), and then accessingprocessed_sentence.ents
. I had initially tried with just a few examples, and it didn't work. Then I read this (in #773 ):So I thought I would try to overfit the model by training it in the same sentences for a big number of epochs and see what happens. I found lots of addresses in http://results.openaddresses.io/. I chose the addresses in Thüringen (Germany) and randomly picked 25000 addresses there. I want to cause the entity recognizer to take these as "GPE". Using 12 small sentence templates, I generated 23105 sentences using these addresses, with taggings saying in which character a GPE started and in which character it ended (just like in the tutorial). There are less sentences because some sentences required two addresses (they are like "I moved from {} to {}").
Finally, I trained the entity recognizer (using the function above and this dataset) for 5, 20, 50 and 100 epochs. Still, it seems all this training didn't make any difference. I.e., when I try any of the sentences in my training set (the sentences I trained it on), it still gives me the same results it used to give without any training. E.g., the sentence
Waldstraße, 10, Dermbach is where I live
(which is one of the sentences in the training set)
still gives me these three entities (the following printing formatting is for my convenience)
(see issue #858 for the Unicode strangeness -- shouldn't be a problem here)
which are the same as it originally would give me without any training.
Am I doing anything wrong? Is this training procedure somehow wrong? Any ideas? Should I try more epochs?
[Now I'll probably take a look at how RASA NLU does it... because if they manage to make it work, then I am probably doing some silly mitake]
Your Environment
The text was updated successfully, but these errors were encountered: