Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

spacy 3.0 warnings about lemmatization and POS #5036

Closed
nelson-liu opened this issue Mar 4, 2021 · 6 comments · Fixed by #5066
Closed

spacy 3.0 warnings about lemmatization and POS #5036

nelson-liu opened this issue Mar 4, 2021 · 6 comments · Fixed by #5066
Assignees
Labels

Comments

@nelson-liu
Copy link
Contributor

I'm training a model (https://github.com/allenai/allennlp-models/blob/main/training_config/pair_classification/mnli_roberta.jsonnet) with allennlp 2.1.0, using SpaCy 3.

There are a bunch of warnings that show up (I'm not sure if these were here before, but I noticed them now because my log files are now massive):

[W108] The rule-based lemmatizer did not find POS annotation for the token 'Tommy'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'hesitated'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token '.'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'Tommy'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'hesitated'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'for'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'a'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'short'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
...

I think this is because allennlp.common.util.get_spacy_model (

def get_spacy_model(
) has POS off by default, but doesn't disable lemmatization by default.

Not sure what y'all think is the best way to solve this...can add a lemmatization argument for get_spacy_model that is default by false? This is a change in the defaults from previous versions, though.

@nelson-liu nelson-liu added the bug label Mar 4, 2021
@nelson-liu
Copy link
Contributor Author

nelson-liu commented Mar 5, 2021

I did a bit of digging, and I feel like this issue is really hard to solve in a way that's perfectly backwards-compatible / preserves existing behavior.

So, before SpaCy 3.0, a pipeline without a tagger but with a lemmatizer would use a lookup-table based lemmatizer (e.g., see below):

In [1]: import spacy

In [2]: spacy.__version__
Out[2]: '2.3.5'

In [3]: nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser', 'tagger'])

In [4]: sent = "Manchester United is looking to sign a forward for $90 million"

In [5]: nlp(sent, disable=['ner', 'parser', 'tagger'])[3].lemma_
Out[5]: 'look'

In [6]: nlp(sent, disable=['ner', 'parser', 'tagger'])[3].pos_
Out[6]: ''

So in this case, looking is lemmatized to look.

However, in 3.0, this no longer works. If you try to lemmatize something that doesn't have a POS tag, SpaCy emits the error originally described in this issue and then just returns the lowercase version of token, which is hardly useful lemmatization. See https://github.com/explosion/spaCy/blob/master/spacy/pipeline/lemmatizer.py#L183-L186 .

So functionally, lemmas are broken by default and noisy when using SpaCy 3.0 + AllenNLP master. There are a few paths forward I can think about:

  • Force the tagger to always be on if spacy > 3 (preserves the behavior of having lemmas in tokens, doesn't preserve the behavior of not running the tagger)
  • Turn off the lemmatizer if spacy > 3 (preserves the behavior of not running the tagger, but doesn't preserve the behavior of having lemmas in the token)

Anyway, feels like there's a fundamental tradeoff to be made here because in spacy 3, you need tagging for lemmatization, and the old default in spacy 2 was to do lemmatization without tagging.

@nelson-liu
Copy link
Contributor Author

(personally, as someone who doesn't use the lemmatizer / lemma information, i'd be in favor of just disabling it by default when spacy > 3. If you're relying on the lemmas, you shouldn't be expecting them to be implicitly set by spacy anyway, and you can go turn it on in your tokenizer.)

@leo-liuzy
Copy link
Contributor

leo-liuzy commented Mar 5, 2021

Thanks for the digging! Really helpful in elaborating on the issue!
When I am in a similar situation (not using spaCy, but conceptually similar), I always like more information produced than less, do you see any bad side of doing so? Is there any speed issue if we always keep tagging on?

@nelson-liu
Copy link
Contributor Author

In this case, a log message gets printed for literally every token (note: not every type) in my dataset, which pollutes the stderr and stdout. I also save both files, and both of these files are now 1 GB+ simply because of this one warning (when they should be in the hundreds of KBs).

@leo-liuzy
Copy link
Contributor

I thought the warning is generated bc the PoS tagger is not on? And I was thinking that always keeping the tagger on doesn't increase the computation much.

@dirkgr
Copy link
Member

dirkgr commented Mar 12, 2021

@leo-liuzy will make a PR that turns on tagging by default.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants