Spacy lemmatizer to use blank model #272

lizgzil · 2021-04-09T09:47:02Z

Description

Fixes https://github.com/wellcometrust/WellcomeML/issues/271 by using blank spacy model.
Running on 1000 grant description: the previous method took 33 secs (when tagger is disabled) and 14 secs (when tagger is not disabled). This new method takes 2 secs.

I also noticed some unusual performance where other methods don't seem to actually lemmatize the text.

X = ['the cats sat on the sitting room floors enjoying the sun']

Original method (when the model is already downloaded)

nlp = spacy.load("en_core_web_sm", disable=["ner", "tagger", "parser", "textcat"])
output= [[token.lemma_.lower() for token in doc] for doc in nlp.pipe(X)]

gives loads of warning for every token of the form "spacy WARNING: [W108] The rule-based lemmatizer did not find POS annotation for the token 'sun'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'" (allenai seem to have had a similar issue with it)
and the output is:

[['the', 'cats', 'sat', 'on', 'the', 'sitting', 'room', 'floors', 'enjoying', 'the', 'sun']]

I realise this had a bug in it where the tagger shouldn't have been disabled, i.e.

nlp = spacy.load("en_core_web_sm", disable=["ner", "parser", "textcat"])
output= [[token.lemma_.lower() for token in doc] for doc in nlp.pipe(X)]

does seem to lemmatize correctly.

I also tried the following method:

from spacy.lang.en import English
from spacy.pipeline.lemmatizer import Lemmatizer
nlp = English()
lemmatizer = Lemmatizer(nlp, model=None,  mode= "lookup")
output= [[lemmatizer.lookup_lemmatize(token)[0].lower() for token in doc] for doc in nlp.pipe(X)]

which also gave:

[['the', 'cats', 'sat', 'on', 'the', 'sitting', 'room', 'floors', 'enjoying', 'the', 'sun']]

The final method, which I've commited is:

nlp = spacy.blank("en")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
nlp.initialize()
output= [[token.lemma_.lower() for token in doc] for doc in nlp.pipe(X)]

gives:

[['the', 'cat', 'sit', 'on', 'the', 'sit', 'room', 'floor', 'enjoy', 'the', 'sun']]

Checklist

Added link to Github issue or Trello card
Added tests

aCampello

Nice find. Is it much quicker then?

aCampello · 2021-04-09T10:09:06Z

wellcomeml/ml/frequency_vectorizer.py

-            )
-
+        nlp = spacy.blank("en")
+        nlp.add_pipe("lemmatizer", config={"mode": "lookup"})


Nice find. I think this needs spacy-lookup-data to be install (which is probably installed automatically with spacy[lookup].

just did a little experiment with a blank virtualenv and it seems like as long as spacy[lookups] is installed then in the script import spacy is enough for this to work. Or am I confusing something?

can you try what happens if en_core_web_sm model is not installed? I think an IOError is thrown still so you should keep the exception handling that @aCampello had

seems to work without it being installed:

>>> import spacy >>> nlp=spacy.load('en_core_web_sm') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/gallaghe/Code/nutrition-labels/build/virtualenv/lib/python3.7/site-packages/spacy/__init__.py", line 47, in load return util.load_model(name, disable=disable, exclude=exclude, config=config) File "/Users/gallaghe/Code/nutrition-labels/build/virtualenv/lib/python3.7/site-packages/spacy/util.py", line 329, in load_model raise IOError(Errors.E050.format(name=name)) OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory. >>> nlp = spacy.blank("en") >>> nlp.add_pipe("lemmatizer", config={"mode": "lookup"}) <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x15ad7ca00> >>> nlp.initialize() <thinc.optimizers.Optimizer object at 0x15aba8050> >>> X = ['the cats sat on the sitting room floors enjoying the sun'] >>> [[token.lemma_.lower() for token in doc] for doc in nlp.pipe(X)] [['the', 'cat', 'sit', 'on', 'the', 'sit', 'room', 'floor', 'enjoy', 'the', 'sun']]

ok cool, just realised that you initialise a blank model, good 👍 thanks for checking

Yeah, that's actually a big advantage. You just need to install the lookups, not the entire model! Much simpler code too.

lizgzil · 2021-04-09T10:33:21Z

Nice find. Is it much quicker then?

Running on 1000 grant descriptions the previous method took around 14 secs (when tagger is not disabled) and this new method takes 2 seconds.

aCampello · 2021-04-09T10:54:10Z

Nice find. Is it much quicker then?

Running on 1000 grant descriptions the previous method took around 14 secs (when tagger is not disabled) and this new method takes 2 seconds.

⚡

aCampello · 2021-04-09T10:56:06Z

Pipeline seems to fail with a weird error.

lizgzil · 2021-04-09T11:00:40Z

Pipeline seems to fail with a weird error.

@aCampello :( seems to be the same one as you got when you opened https://github.com/wellcometrust/WellcomeML/issues/270. Did you do anything to get your PR to pass in the end? As @nsorros said "Some non deterministic test fail here showing up again."

aCampello · 2021-04-09T11:02:24Z

Don't know, I re-ran and it passed! I assume some memory error again.

aCampello · 2021-04-09T11:02:44Z

Let me have a look. Does tox pass locally?

lizgzil · 2021-04-09T11:16:09Z

Let me have a look. Does tox pass locally?

@aCampello it's still running locally for 3.8, but has passed for 3.7

aCampello · 2021-04-09T11:20:29Z

I re-triggered your pipeline. Let's see.

nsorros

Blocking so we check whether the exception handling is still needed

change spacy lemmatizer to use blank model

1d21a03

lizgzil requested review from nsorros and aCampello April 9, 2021 09:47

aCampello approved these changes Apr 9, 2021

View reviewed changes

nsorros suggested changes Apr 9, 2021

View reviewed changes

nsorros approved these changes Apr 9, 2021

View reviewed changes

lizgzil merged commit e071b21 into main Apr 9, 2021

lizgzil deleted the fix-spacy-lemmatizer branch April 9, 2021 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spacy lemmatizer to use blank model #272

Spacy lemmatizer to use blank model #272

lizgzil commented Apr 9, 2021 •

edited

Loading

aCampello left a comment

aCampello Apr 9, 2021

lizgzil Apr 9, 2021

nsorros Apr 9, 2021

lizgzil Apr 9, 2021

nsorros Apr 9, 2021

aCampello Apr 9, 2021

lizgzil commented Apr 9, 2021

aCampello commented Apr 9, 2021

aCampello commented Apr 9, 2021

lizgzil commented Apr 9, 2021

aCampello commented Apr 9, 2021

aCampello commented Apr 9, 2021

lizgzil commented Apr 9, 2021

aCampello commented Apr 9, 2021

nsorros left a comment

Spacy lemmatizer to use blank model #272

Spacy lemmatizer to use blank model #272

Conversation

lizgzil commented Apr 9, 2021 • edited Loading

Description

Checklist

aCampello left a comment

Choose a reason for hiding this comment

aCampello Apr 9, 2021

Choose a reason for hiding this comment

lizgzil Apr 9, 2021

Choose a reason for hiding this comment

nsorros Apr 9, 2021

Choose a reason for hiding this comment

lizgzil Apr 9, 2021

Choose a reason for hiding this comment

nsorros Apr 9, 2021

Choose a reason for hiding this comment

aCampello Apr 9, 2021

Choose a reason for hiding this comment

lizgzil commented Apr 9, 2021

aCampello commented Apr 9, 2021

aCampello commented Apr 9, 2021

lizgzil commented Apr 9, 2021

aCampello commented Apr 9, 2021

aCampello commented Apr 9, 2021

lizgzil commented Apr 9, 2021

aCampello commented Apr 9, 2021

nsorros left a comment

Choose a reason for hiding this comment

lizgzil commented Apr 9, 2021 •

edited

Loading