spacy 3.0 warnings about lemmatization and POS #5036

nelson-liu · 2021-03-04T21:39:14Z

I'm training a model (https://github.com/allenai/allennlp-models/blob/main/training_config/pair_classification/mnli_roberta.jsonnet) with allennlp 2.1.0, using SpaCy 3.

There are a bunch of warnings that show up (I'm not sure if these were here before, but I noticed them now because my log files are now massive):

[W108] The rule-based lemmatizer did not find POS annotation for the token 'Tommy'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'hesitated'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token '.'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'Tommy'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'hesitated'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'for'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'a'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'short'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
...

I think this is because allennlp.common.util.get_spacy_model (

allennlp/allennlp/common/util.py

Line 258 in 96415b2

def get_spacy_model(

) has POS off by default, but doesn't disable lemmatization by default.

Not sure what y'all think is the best way to solve this...can add a lemmatization argument for get_spacy_model that is default by false? This is a change in the defaults from previous versions, though.

The text was updated successfully, but these errors were encountered:

nelson-liu · 2021-03-05T03:53:40Z

I did a bit of digging, and I feel like this issue is really hard to solve in a way that's perfectly backwards-compatible / preserves existing behavior.

So, before SpaCy 3.0, a pipeline without a tagger but with a lemmatizer would use a lookup-table based lemmatizer (e.g., see below):

In [1]: import spacy

In [2]: spacy.__version__
Out[2]: '2.3.5'

In [3]: nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser', 'tagger'])

In [4]: sent = "Manchester United is looking to sign a forward for $90 million"

In [5]: nlp(sent, disable=['ner', 'parser', 'tagger'])[3].lemma_
Out[5]: 'look'

In [6]: nlp(sent, disable=['ner', 'parser', 'tagger'])[3].pos_
Out[6]: ''

So in this case, looking is lemmatized to look.

However, in 3.0, this no longer works. If you try to lemmatize something that doesn't have a POS tag, SpaCy emits the error originally described in this issue and then just returns the lowercase version of token, which is hardly useful lemmatization. See https://github.com/explosion/spaCy/blob/master/spacy/pipeline/lemmatizer.py#L183-L186 .

So functionally, lemmas are broken by default and noisy when using SpaCy 3.0 + AllenNLP master. There are a few paths forward I can think about:

Force the tagger to always be on if spacy > 3 (preserves the behavior of having lemmas in tokens, doesn't preserve the behavior of not running the tagger)
Turn off the lemmatizer if spacy > 3 (preserves the behavior of not running the tagger, but doesn't preserve the behavior of having lemmas in the token)

Anyway, feels like there's a fundamental tradeoff to be made here because in spacy 3, you need tagging for lemmatization, and the old default in spacy 2 was to do lemmatization without tagging.

nelson-liu · 2021-03-05T03:54:50Z

(personally, as someone who doesn't use the lemmatizer / lemma information, i'd be in favor of just disabling it by default when spacy > 3. If you're relying on the lemmas, you shouldn't be expecting them to be implicitly set by spacy anyway, and you can go turn it on in your tokenizer.)

leo-liuzy · 2021-03-05T15:44:22Z

Thanks for the digging! Really helpful in elaborating on the issue!
When I am in a similar situation (not using spaCy, but conceptually similar), I always like more information produced than less, do you see any bad side of doing so? Is there any speed issue if we always keep tagging on?

nelson-liu · 2021-03-05T18:44:03Z

In this case, a log message gets printed for literally every token (note: not every type) in my dataset, which pollutes the stderr and stdout. I also save both files, and both of these files are now 1 GB+ simply because of this one warning (when they should be in the hundreds of KBs).

leo-liuzy · 2021-03-05T19:55:32Z

I thought the warning is generated bc the PoS tagger is not on? And I was thinking that always keeping the tagger on doesn't increase the computation much.

dirkgr · 2021-03-12T17:38:52Z

@leo-liuzy will make a PR that turns on tagging by default.

nelson-liu added the bug label Mar 4, 2021

dirkgr assigned leo-liuzy Mar 12, 2021

himkt mentioned this issue Mar 17, 2021

Use whitespace tokenizer instead of spacy tokenizer optuna/optuna#2494

Merged

leo-liuzy mentioned this issue Mar 24, 2021

Always keep Spacy PoS tagger on #5066

Merged

epwalsh closed this as completed in #5066 Mar 24, 2021

lizgzil mentioned this issue Apr 9, 2021

Spacy lemmatizer to use blank model wellcometrust/WellcomeML#272

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spacy 3.0 warnings about lemmatization and POS #5036

spacy 3.0 warnings about lemmatization and POS #5036

nelson-liu commented Mar 4, 2021

nelson-liu commented Mar 5, 2021 •

edited

Loading

nelson-liu commented Mar 5, 2021

leo-liuzy commented Mar 5, 2021 •

edited

Loading

nelson-liu commented Mar 5, 2021

leo-liuzy commented Mar 5, 2021

dirkgr commented Mar 12, 2021

spacy 3.0 warnings about lemmatization and POS #5036

spacy 3.0 warnings about lemmatization and POS #5036

Comments

nelson-liu commented Mar 4, 2021

nelson-liu commented Mar 5, 2021 • edited Loading

nelson-liu commented Mar 5, 2021

leo-liuzy commented Mar 5, 2021 • edited Loading

nelson-liu commented Mar 5, 2021

leo-liuzy commented Mar 5, 2021

dirkgr commented Mar 12, 2021

nelson-liu commented Mar 5, 2021 •

edited

Loading

leo-liuzy commented Mar 5, 2021 •

edited

Loading