Customized tokenizer reports 'AssertionError' with specific infixes #1292

Vimos · 2017-08-30T06:45:10Z

When I trying to add '—-' to the infix_re, whole re is '[\[\]!&:,()\*—\-]', for a custom tokenizer, I got the following error.

Traceback (most recent call last):
  File "/home/vimos/Public/github/NLP/keras-nli/examples/squad/sequence_pointer/preprocessing.py", line 15, in <module>
    o.preprocessing()
  File "/home/vimos/Public/github/NLP/keras-nli/kerasnli/preprocessing/squad.py", line 223, in preprocessing
    self.to_sequence()
  File "/home/vimos/Public/github/NLP/keras-nli/kerasnli/preprocessing/squad.py", line 119, in to_sequence
    context = [word.text for word in nlp(p['context'])]
  File "/data/home/vimos/Public/git/github/NLP/spaCy/spacy/language.py", line 320, in __call__
    doc = self.make_doc(text)
  File "/home/vimos/Public/github/NLP/keras-nli/kerasnli/preprocessing/squad.py", line 36, in <lambda>
    return lambda text: tokenizer(text)
  File "tokenizer.pyx", line 152, in spacy.tokenizer.Tokenizer.__call__
  File "tokenizer.pyx", line 217, in spacy.tokenizer.Tokenizer._tokenize
  File "tokenizer.pyx", line 315, in spacy.tokenizer.Tokenizer._attach_tokens
  File "doc.pyx", line 469, in spacy.tokens.doc.Doc.push_back
AssertionError

I am using the head of spacy, but same error on an earlier version. Only '—-' is causing this problem.

After some hacking, I find a suspect for the error, trailing '-' in the string

# coding: utf-8
from __future__ import unicode_literals

from ... import load
from ...tokenizer import Tokenizer
from ... import util

import pytest


def test_customized_tokenizer_handles_infixes():
    def custom_tokenizer(nlp_model):
        prefix_re = util.compile_prefix_regex(nlp_model.Defaults.prefixes)
        suffix_re = util.compile_suffix_regex(nlp_model.Defaults.suffixes)
        custom_infixes = ['\.\.\.+',
                          '(?<=[0-9])-(?=[0-9])',
                          # '(?<=[0-9]+),(?=[0-9]+)',
                          '[0-9]+(,[0-9]+)+',
                          u'[\[\]!&:,()\*—–\/-]']

        infix_re = util.compile_infix_regex(custom_infixes)

        # infix_re = re.compile(ur'[\[\]!&:,()]')

        tokenizer = Tokenizer(nlp_model.vocab,
                              nlp_model.Defaults.tokenizer_exceptions,
                              prefix_re.search,
                              suffix_re.search,
                              infix_re.finditer,
                              token_match=None)
        return lambda text: tokenizer(text)

    nlp = load('en', create_make_doc=custom_tokenizer)

    sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion."
    context = [word.text for word in nlp(sentence)]
    assert context == [u'The', u'8', u'and', u'10', u'-', u'county', u'definitions', u'are', u'not', u'used',
                       u'for',
                       u'the', u'greater', u'Southern', u'California', u'Megaregion', u'.']

    # the trailing '-' may cause Assertion Error
    sentence = "The 8- and 10-county definitions are not used for the greater Southern California Megaregion."
    context = [word.text for word in nlp(sentence)]
    assert context == [u'The', u'8', u'-', u'and', u'10', u'-', u'county', u'definitions', u'are', u'not', u'used',
                       u'for',
                       u'the', u'greater', u'Southern', u'California', u'Megaregion', u'.']

Your Environment

Operating System: Ubuntu 16.04
Python Version Used: python 2.7
spaCy Version Used: master head
Environment Information:

The text was updated successfully, but these errors were encountered:

Vimos · 2017-08-30T06:46:47Z

I have found a solution for this bug, I'd like to push a pull request.

diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx
index 276f0ef..799e4bd 100644
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@@ -312,7 +312,8 @@ cdef class Tokenizer:
 
                         start = infix_end
                     span = string[start:]
-                    tokens.push_back(self.vocab.get(tokens.mem, span), False)
+                    if span:
+                        tokens.push_back(self.vocab.get(tokens.mem, span), False)
         cdef vector[const LexemeC*].reverse_iterator it = suffixes.rbegin()
         while it != suffixes.rend():
             lexeme = deref(it)

honnibal · 2017-08-31T18:23:44Z

Thanks, this looks good!

Fix issue #1292 and add test case for the Assertion Error

lock · 2018-05-08T17:27:51Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added a commit that referenced this issue Sep 4, 2017

Merge pull request #1294 from Vimos/master

34c5853

Fix issue #1292 and add test case for the Assertion Error

Vimos closed this as completed Sep 6, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customized tokenizer reports 'AssertionError' with specific infixes #1292

Customized tokenizer reports 'AssertionError' with specific infixes #1292

Vimos commented Aug 30, 2017

Vimos commented Aug 30, 2017

honnibal commented Aug 31, 2017

lock bot commented May 8, 2018

Customized tokenizer reports 'AssertionError' with specific infixes #1292

Customized tokenizer reports 'AssertionError' with specific infixes #1292

Comments

Vimos commented Aug 30, 2017

Your Environment

Vimos commented Aug 30, 2017

honnibal commented Aug 31, 2017

lock bot commented May 8, 2018