Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mention extraction issue with external sentence boundaries #1929

Closed
joelb-git opened this issue Feb 3, 2018 · 4 comments
Closed

Mention extraction issue with external sentence boundaries #1929

joelb-git opened this issue Feb 3, 2018 · 4 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@joelb-git
Copy link

Mention extraction issue with external sentence boundaries

We're trying to make spacy respect sentence boundaries provided
externally using this method:

#1400

We found that this affects the mention extraction. We then tried to
pass each sentence individually, using this trick to prevent spacy
adding more sentence breaks:

#1032 (comment)

But this also seems to affect mention extraction in a similar way.

The test below runs a single sentence through spacy showing the
before/after behavior when changing is_sent_start.

$ cat foo.py
import spacy


nlp = spacy.load('en')
doc = nlp('Bill Clinton was president.')
print([t.is_sent_start for t in doc])
print([m for m in doc.ents])

def manual_sentence_segmentation(doc):
    for i, token in enumerate(doc):
        token.is_sent_start = i == 0
    return doc

nlp.add_pipe(manual_sentence_segmentation, name='manual-sbd', before='parser')
doc = nlp('Bill Clinton was president.')
print([t.is_sent_start for t in doc])
print([m for m in doc.ents])

Output:

$ python foo.py
[None, None, None, None, None]
[Bill Clinton]
[True, False, False, False, False]
[Clinton]  <-- expected this to be [Bill Clinton]

Info about spaCy

  • spaCy version: 2.0.4
  • Platform: Darwin-16.7.0-x86_64-i386-64bit
  • Python version: 3.6.0
  • Models: en
@honnibal
Copy link
Member

honnibal commented Feb 8, 2018

This seems to be a bug related to the is_sent_start value on the first word of the document. Try not setting the first token --- it seems to work without it?

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Feb 8, 2018
@joelb-git
Copy link
Author

Setting is_sent_start to False on the first token (and others) results in the same behavior:

[None, None, None, None, None]
[Bill Clinton]
[False, False, False, False, False]
[Clinton]

@honnibal
Copy link
Member

honnibal commented Feb 8, 2018

Hmm I made a mistake in the example I tried. I'm not sure what's going on here -- seems like there might be a bug in the NER transition system.

Edit: Yes, definitely. The sent_start attribute was changed to take ternary values, and there's code in the NER transition system that's checking it as a boolean. This means the -1 value is evaluating to True. Thanks for the report --- will have this fixed in the next version.

@lock
Copy link

lock bot commented May 7, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

2 participants