-
Notifications
You must be signed in to change notification settings - Fork 9
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find. Is it much quicker then?
) | ||
|
||
nlp = spacy.blank("en") | ||
nlp.add_pipe("lemmatizer", config={"mode": "lookup"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find. I think this needs spacy-lookup-data
to be install (which is probably installed automatically with spacy[lookup]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just did a little experiment with a blank virtualenv and it seems like as long as spacy[lookups]
is installed then in the script import spacy
is enough for this to work. Or am I confusing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try what happens if en_core_web_sm
model is not installed? I think an IOError is thrown still so you should keep the exception handling that @aCampello had
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to work without it being installed:
>>> import spacy
>>> nlp=spacy.load('en_core_web_sm')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/gallaghe/Code/nutrition-labels/build/virtualenv/lib/python3.7/site-packages/spacy/__init__.py", line 47, in load
return util.load_model(name, disable=disable, exclude=exclude, config=config)
File "/Users/gallaghe/Code/nutrition-labels/build/virtualenv/lib/python3.7/site-packages/spacy/util.py", line 329, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
>>> nlp = spacy.blank("en")
>>> nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
<spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x15ad7ca00>
>>> nlp.initialize()
<thinc.optimizers.Optimizer object at 0x15aba8050>
>>> X = ['the cats sat on the sitting room floors enjoying the sun']
>>> [[token.lemma_.lower() for token in doc] for doc in nlp.pipe(X)]
[['the', 'cat', 'sit', 'on', 'the', 'sit', 'room', 'floor', 'enjoy', 'the', 'sun']]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool, just realised that you initialise a blank model, good 👍 thanks for checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's actually a big advantage. You just need to install the lookups, not the entire model! Much simpler code too.
Running on 1000 grant descriptions the previous method took around 14 secs (when tagger is not disabled) and this new method takes 2 seconds. |
⚡ |
Pipeline seems to fail with a weird error. |
@aCampello :( seems to be the same one as you got when you opened https://github.com/wellcometrust/WellcomeML/issues/270. Did you do anything to get your PR to pass in the end? As @nsorros said "Some non deterministic test fail here showing up again." |
Don't know, I re-ran and it passed! I assume some memory error again. |
Let me have a look. Does |
@aCampello it's still running locally for 3.8, but has passed for 3.7 |
I re-triggered your pipeline. Let's see. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking so we check whether the exception handling is still needed
Description
Fixes https://github.com/wellcometrust/WellcomeML/issues/271 by using blank spacy model.
Running on 1000 grant description: the previous method took 33 secs (when tagger is disabled) and 14 secs (when tagger is not disabled). This new method takes 2 secs.
I also noticed some unusual performance where other methods don't seem to actually lemmatize the text.
Original method (when the model is already downloaded)
gives loads of warning for every token of the form "spacy WARNING: [W108] The rule-based lemmatizer did not find POS annotation for the token 'sun'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'" (allenai seem to have had a similar issue with it)
and the output is:
I realise this had a bug in it where the tagger shouldn't have been disabled, i.e.
does seem to lemmatize correctly.
I also tried the following method:
which also gave:
The final method, which I've commited is:
gives:
Checklist