Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalized form of "am" is "a.m." #2754

Closed
patperry opened this issue Sep 11, 2018 · 7 comments
Closed

Normalized form of "am" is "a.m." #2754

patperry opened this issue Sep 11, 2018 · 7 comments
Labels
bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer perf / accuracy Performance: accuracy

Comments

@patperry
Copy link

It appears that for the default English model, the normalized form of "am" is "a.m.". This seems reasonable, but then shouldn't the normalized form of "I'm" be ["i", "a.m."]?

Would it be better to make the normalized form of "am" and "a.m." be "am"?

How to reproduce the behaviour

from spacy.lang.en import English
nlp = English()
print([tok.norm_ for tok in nlp("I'm fine. I am fine")])
#> ['i', 'am', 'fine', '.', 'i', 'a.m.', 'fine']

Your Environment

  • Operating System: Linux version 4.9.93-linuxkit-aufs
  • Python Version Used: 3.7.0
  • spaCy Version Used: 2.0.12
  • Environment Information:
@honnibal
Copy link
Member

I think this is probably a bad norm exception, because it really refers to "am" as in the time. The more common case is the verb to be --- so we should remove this.

@honnibal honnibal added the perf / accuracy Performance: accuracy label Sep 12, 2018
@ines ines added the feat / tokenizer Feature: Tokenizer label Sep 12, 2018
@gavrieltal
Copy link
Contributor

gavrieltal commented Oct 1, 2018

Is this supposed to be the relevant code, in tokenizer_exceptions.py, around line 290?

for h in range(1, 12 + 1):                                                                      
    for period in ["a.m.", "am"]:                                                               
        _exc["%d%s" % (h, period)] = [                                                          
            {ORTH: "%d" % h},                                                                   
            {ORTH: period, LEMMA: "a.m.", NORM: "a.m."}]                                        
    for period in ["p.m.", "pm"]:                                                               
        _exc["%d%s" % (h, period)] = [                                                          
            {ORTH: "%d" % h},                                                                   
            {ORTH: period, LEMMA: "p.m.", NORM: "p.m."}]

And, if so, why would one get that the norm of token "am" in "I am fine" is "a.m." when there was no number beforehand such as in "3am"?

@ines
Copy link
Member

ines commented Oct 1, 2018

And, if so, why would one get that the norm of token "am" in "I am fine" is "a.m." when there was no number beforehand such as in "3am"?

You're correct, this doesn't make sense and the above exceptions really only target tokens like "1am" or "10p.m".

I've been trying to find where this norm is set and it's pretty mysterious... 🤔 I haven't been able to figure it out yet.

@patperry
Copy link
Author

patperry commented Oct 4, 2018

Also this is troubling:

import spacy
from spacy.lang.en import English
nlp = English()
print([tok.norm_ for tok in nlp('The normalized form of "a" is "gonna".')])
#> ['the', 'normalized', 'form', 'of', '"', 'gonna', '"', 'is', '"', 'going', 'to', '"', '.']

@ines
Copy link
Member

ines commented Oct 5, 2018

@patperry

print([tok.norm_ for tok in nlp('The normalized form of "a" is "gonna".')])

Thanks for the example – this might explain a lot. Looks like this is actually an issue with the caching of the token attributes. There was a similar bug a while ago, but I'm pretty sure we fixed that. But it might still be related.

@ines ines added the bug Bugs and behaviour differing from documentation label Oct 5, 2018
honnibal added a commit that referenced this issue Dec 8, 2018
@honnibal
Copy link
Member

honnibal commented Dec 8, 2018

Fix should now be up on develop: #3029

Thanks for your help reporting this.

@honnibal honnibal closed this as completed Dec 8, 2018
@lock
Copy link

lock bot commented Jan 7, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

4 participants