Normalized form of "am" is "a.m." #2754

patperry · 2018-09-11T18:19:20Z

It appears that for the default English model, the normalized form of "am" is "a.m.". This seems reasonable, but then shouldn't the normalized form of "I'm" be ["i", "a.m."]?

Would it be better to make the normalized form of "am" and "a.m." be "am"?

How to reproduce the behaviour

from spacy.lang.en import English
nlp = English()
print([tok.norm_ for tok in nlp("I'm fine. I am fine")])
#> ['i', 'am', 'fine', '.', 'i', 'a.m.', 'fine']

Your Environment

Operating System: Linux version 4.9.93-linuxkit-aufs
Python Version Used: 3.7.0
spaCy Version Used: 2.0.12
Environment Information:

honnibal · 2018-09-12T11:54:20Z

I think this is probably a bad norm exception, because it really refers to "am" as in the time. The more common case is the verb to be --- so we should remove this.

gavrieltal · 2018-10-01T04:50:33Z

Is this supposed to be the relevant code, in tokenizer_exceptions.py, around line 290?

for h in range(1, 12 + 1):                                                                      
    for period in ["a.m.", "am"]:                                                               
        _exc["%d%s" % (h, period)] = [                                                          
            {ORTH: "%d" % h},                                                                   
            {ORTH: period, LEMMA: "a.m.", NORM: "a.m."}]                                        
    for period in ["p.m.", "pm"]:                                                               
        _exc["%d%s" % (h, period)] = [                                                          
            {ORTH: "%d" % h},                                                                   
            {ORTH: period, LEMMA: "p.m.", NORM: "p.m."}]

And, if so, why would one get that the norm of token "am" in "I am fine" is "a.m." when there was no number beforehand such as in "3am"?

ines · 2018-10-01T09:01:52Z

And, if so, why would one get that the norm of token "am" in "I am fine" is "a.m." when there was no number beforehand such as in "3am"?

You're correct, this doesn't make sense and the above exceptions really only target tokens like "1am" or "10p.m".

I've been trying to find where this norm is set and it's pretty mysterious... 🤔 I haven't been able to figure it out yet.

patperry · 2018-10-04T17:57:00Z

Also this is troubling:

import spacy
from spacy.lang.en import English
nlp = English()
print([tok.norm_ for tok in nlp('The normalized form of "a" is "gonna".')])
#> ['the', 'normalized', 'form', 'of', '"', 'gonna', '"', 'is', '"', 'going', 'to', '"', '.']

ines · 2018-10-05T19:10:42Z

@patperry

print([tok.norm_ for tok in nlp('The normalized form of "a" is "gonna".')])

Thanks for the example – this might explain a lot. Looks like this is actually an issue with the caching of the token attributes. There was a similar bug a while ago, but I'm pretty sure we fixed that. But it might still be related.

honnibal · 2018-12-08T11:07:56Z

Fix should now be up on develop: #3029

Thanks for your help reporting this.

lock · 2019-01-07T11:28:07Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the perf / accuracy Performance: accuracy label Sep 12, 2018

ines added the feat / tokenizer Feature: Tokenizer label Sep 12, 2018

ines added the bug Bugs and behaviour differing from documentation label Oct 5, 2018

honnibal added a commit that referenced this issue Dec 8, 2018

Add test for #2754

0f3fb9d

honnibal closed this as completed Dec 8, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalized form of "am" is "a.m." #2754

Normalized form of "am" is "a.m." #2754

patperry commented Sep 11, 2018

honnibal commented Sep 12, 2018

gavrieltal commented Oct 1, 2018 •

edited

Loading

ines commented Oct 1, 2018 •

edited

Loading

patperry commented Oct 4, 2018

ines commented Oct 5, 2018

honnibal commented Dec 8, 2018

lock bot commented Jan 7, 2019

Normalized form of "am" is "a.m." #2754

Normalized form of "am" is "a.m." #2754

Comments

patperry commented Sep 11, 2018

How to reproduce the behaviour

Your Environment

honnibal commented Sep 12, 2018

gavrieltal commented Oct 1, 2018 • edited Loading

ines commented Oct 1, 2018 • edited Loading

patperry commented Oct 4, 2018

ines commented Oct 5, 2018

honnibal commented Dec 8, 2018

lock bot commented Jan 7, 2019

gavrieltal commented Oct 1, 2018 •

edited

Loading

ines commented Oct 1, 2018 •

edited

Loading