Mention extraction issue with external sentence boundaries #1929

joelb-git · 2018-02-03T13:24:53Z

Mention extraction issue with external sentence boundaries

We're trying to make spacy respect sentence boundaries provided
externally using this method:

We found that this affects the mention extraction. We then tried to
pass each sentence individually, using this trick to prevent spacy
adding more sentence breaks:

#1032 (comment)

But this also seems to affect mention extraction in a similar way.

The test below runs a single sentence through spacy showing the
before/after behavior when changing is_sent_start.

$ cat foo.py
import spacy


nlp = spacy.load('en')
doc = nlp('Bill Clinton was president.')
print([t.is_sent_start for t in doc])
print([m for m in doc.ents])

def manual_sentence_segmentation(doc):
    for i, token in enumerate(doc):
        token.is_sent_start = i == 0
    return doc

nlp.add_pipe(manual_sentence_segmentation, name='manual-sbd', before='parser')
doc = nlp('Bill Clinton was president.')
print([t.is_sent_start for t in doc])
print([m for m in doc.ents])

Output:

$ python foo.py
[None, None, None, None, None]
[Bill Clinton]
[True, False, False, False, False]
[Clinton]  <-- expected this to be [Bill Clinton]

Info about spaCy

spaCy version: 2.0.4
Platform: Darwin-16.7.0-x86_64-i386-64bit
Python version: 3.6.0
Models: en

The text was updated successfully, but these errors were encountered:

honnibal · 2018-02-08T11:35:21Z

This seems to be a bug related to the is_sent_start value on the first word of the document. Try not setting the first token --- it seems to work without it?

joelb-git · 2018-02-08T12:48:27Z

Setting is_sent_start to False on the first token (and others) results in the same behavior:

[None, None, None, None, None]
[Bill Clinton]
[False, False, False, False, False]
[Clinton]

honnibal · 2018-02-08T14:21:36Z

Hmm I made a mistake in the example I tried. I'm not sure what's going on here -- seems like there might be a bug in the NER transition system.

Edit: Yes, definitely. The sent_start attribute was changed to take ternary values, and there's code in the NER transition system that's checking it as a boolean. This means the -1 value is evaluating to True. Thanks for the report --- will have this fixed in the next version.

lock · 2018-05-07T23:55:37Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label Feb 8, 2018

honnibal closed this as completed in e361b4f Feb 8, 2018

lock bot locked as resolved and limited conversation to collaborators May 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mention extraction issue with external sentence boundaries #1929

Mention extraction issue with external sentence boundaries #1929

joelb-git commented Feb 3, 2018

honnibal commented Feb 8, 2018

joelb-git commented Feb 8, 2018

honnibal commented Feb 8, 2018 •

edited

Loading

lock bot commented May 7, 2018

Mention extraction issue with external sentence boundaries #1929

Mention extraction issue with external sentence boundaries #1929

Comments

joelb-git commented Feb 3, 2018