You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We found that this affects the mention extraction. We then tried to
pass each sentence individually, using this trick to prevent spacy
adding more sentence breaks:
But this also seems to affect mention extraction in a similar way.
The test below runs a single sentence through spacy showing the
before/after behavior when changing is_sent_start.
$ cat foo.py
import spacy
nlp = spacy.load('en')
doc = nlp('Bill Clinton was president.')
print([t.is_sent_start for t in doc])
print([m for m in doc.ents])
def manual_sentence_segmentation(doc):
for i, token in enumerate(doc):
token.is_sent_start = i == 0
return doc
nlp.add_pipe(manual_sentence_segmentation, name='manual-sbd', before='parser')
doc = nlp('Bill Clinton was president.')
print([t.is_sent_start for t in doc])
print([m for m in doc.ents])
Output:
$ python foo.py
[None, None, None, None, None]
[Bill Clinton]
[True, False, False, False, False]
[Clinton] <-- expected this to be [Bill Clinton]
Info about spaCy
spaCy version: 2.0.4
Platform: Darwin-16.7.0-x86_64-i386-64bit
Python version: 3.6.0
Models: en
The text was updated successfully, but these errors were encountered:
This seems to be a bug related to the is_sent_start value on the first word of the document. Try not setting the first token --- it seems to work without it?
Hmm I made a mistake in the example I tried. I'm not sure what's going on here -- seems like there might be a bug in the NER transition system.
Edit: Yes, definitely. The sent_start attribute was changed to take ternary values, and there's code in the NER transition system that's checking it as a boolean. This means the -1 value is evaluating to True. Thanks for the report --- will have this fixed in the next version.
Mention extraction issue with external sentence boundaries
We're trying to make spacy respect sentence boundaries provided
externally using this method:
#1400
We found that this affects the mention extraction. We then tried to
pass each sentence individually, using this trick to prevent spacy
adding more sentence breaks:
#1032 (comment)
But this also seems to affect mention extraction in a similar way.
The test below runs a single sentence through spacy showing the
before/after behavior when changing
is_sent_start
.Output:
Info about spaCy
The text was updated successfully, but these errors were encountered: