-
Notifications
You must be signed in to change notification settings - Fork 2.2k
spacy 3.0 warnings about lemmatization and POS #5036
Comments
I did a bit of digging, and I feel like this issue is really hard to solve in a way that's perfectly backwards-compatible / preserves existing behavior. So, before SpaCy 3.0, a pipeline without a tagger but with a lemmatizer would use a lookup-table based lemmatizer (e.g., see below):
So in this case, However, in 3.0, this no longer works. If you try to lemmatize something that doesn't have a POS tag, SpaCy emits the error originally described in this issue and then just returns the lowercase version of token, which is hardly useful lemmatization. See https://github.com/explosion/spaCy/blob/master/spacy/pipeline/lemmatizer.py#L183-L186 . So functionally, lemmas are broken by default and noisy when using SpaCy 3.0 + AllenNLP master. There are a few paths forward I can think about:
Anyway, feels like there's a fundamental tradeoff to be made here because in spacy 3, you need tagging for lemmatization, and the old default in spacy 2 was to do lemmatization without tagging. |
(personally, as someone who doesn't use the lemmatizer / lemma information, i'd be in favor of just disabling it by default when spacy > 3. If you're relying on the lemmas, you shouldn't be expecting them to be implicitly set by spacy anyway, and you can go turn it on in your tokenizer.) |
Thanks for the digging! Really helpful in elaborating on the issue! |
In this case, a log message gets printed for literally every token (note: not every type) in my dataset, which pollutes the stderr and stdout. I also save both files, and both of these files are now 1 GB+ simply because of this one warning (when they should be in the hundreds of KBs). |
I thought the warning is generated bc the PoS tagger is not on? And I was thinking that always keeping the tagger on doesn't increase the computation much. |
@leo-liuzy will make a PR that turns on tagging by default. |
I'm training a model (https://github.com/allenai/allennlp-models/blob/main/training_config/pair_classification/mnli_roberta.jsonnet) with allennlp 2.1.0, using SpaCy 3.
There are a bunch of warnings that show up (I'm not sure if these were here before, but I noticed them now because my log files are now massive):
I think this is because
allennlp.common.util.get_spacy_model
(allennlp/allennlp/common/util.py
Line 258 in 96415b2
Not sure what y'all think is the best way to solve this...can add a lemmatization argument for get_spacy_model that is default by false? This is a change in the defaults from previous versions, though.
The text was updated successfully, but these errors were encountered: