Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customized tokenizer reports 'AssertionError' with specific infixes #1292

Closed
Vimos opened this issue Aug 30, 2017 · 3 comments
Closed

Customized tokenizer reports 'AssertionError' with specific infixes #1292

Vimos opened this issue Aug 30, 2017 · 3 comments

Comments

@Vimos
Copy link
Contributor

Vimos commented Aug 30, 2017

When I trying to add '—-' to the infix_re, whole re is '[\[\]!&:,()\*—\-]', for a custom tokenizer, I got the following error.

Traceback (most recent call last):
  File "/home/vimos/Public/github/NLP/keras-nli/examples/squad/sequence_pointer/preprocessing.py", line 15, in <module>
    o.preprocessing()
  File "/home/vimos/Public/github/NLP/keras-nli/kerasnli/preprocessing/squad.py", line 223, in preprocessing
    self.to_sequence()
  File "/home/vimos/Public/github/NLP/keras-nli/kerasnli/preprocessing/squad.py", line 119, in to_sequence
    context = [word.text for word in nlp(p['context'])]
  File "/data/home/vimos/Public/git/github/NLP/spaCy/spacy/language.py", line 320, in __call__
    doc = self.make_doc(text)
  File "/home/vimos/Public/github/NLP/keras-nli/kerasnli/preprocessing/squad.py", line 36, in <lambda>
    return lambda text: tokenizer(text)
  File "tokenizer.pyx", line 152, in spacy.tokenizer.Tokenizer.__call__
  File "tokenizer.pyx", line 217, in spacy.tokenizer.Tokenizer._tokenize
  File "tokenizer.pyx", line 315, in spacy.tokenizer.Tokenizer._attach_tokens
  File "doc.pyx", line 469, in spacy.tokens.doc.Doc.push_back
AssertionError

I am using the head of spacy, but same error on an earlier version. Only '—-' is causing this problem.

After some hacking, I find a suspect for the error, trailing '-' in the string

# coding: utf-8
from __future__ import unicode_literals

from ... import load
from ...tokenizer import Tokenizer
from ... import util

import pytest


def test_customized_tokenizer_handles_infixes():
    def custom_tokenizer(nlp_model):
        prefix_re = util.compile_prefix_regex(nlp_model.Defaults.prefixes)
        suffix_re = util.compile_suffix_regex(nlp_model.Defaults.suffixes)
        custom_infixes = ['\.\.\.+',
                          '(?<=[0-9])-(?=[0-9])',
                          # '(?<=[0-9]+),(?=[0-9]+)',
                          '[0-9]+(,[0-9]+)+',
                          u'[\[\]!&:,()\*—–\/-]']

        infix_re = util.compile_infix_regex(custom_infixes)

        # infix_re = re.compile(ur'[\[\]!&:,()]')

        tokenizer = Tokenizer(nlp_model.vocab,
                              nlp_model.Defaults.tokenizer_exceptions,
                              prefix_re.search,
                              suffix_re.search,
                              infix_re.finditer,
                              token_match=None)
        return lambda text: tokenizer(text)

    nlp = load('en', create_make_doc=custom_tokenizer)

    sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion."
    context = [word.text for word in nlp(sentence)]
    assert context == [u'The', u'8', u'and', u'10', u'-', u'county', u'definitions', u'are', u'not', u'used',
                       u'for',
                       u'the', u'greater', u'Southern', u'California', u'Megaregion', u'.']

    # the trailing '-' may cause Assertion Error
    sentence = "The 8- and 10-county definitions are not used for the greater Southern California Megaregion."
    context = [word.text for word in nlp(sentence)]
    assert context == [u'The', u'8', u'-', u'and', u'10', u'-', u'county', u'definitions', u'are', u'not', u'used',
                       u'for',
                       u'the', u'greater', u'Southern', u'California', u'Megaregion', u'.']

Your Environment

  • Operating System: Ubuntu 16.04
  • Python Version Used: python 2.7
  • spaCy Version Used: master head
  • Environment Information:
@Vimos
Copy link
Contributor Author

Vimos commented Aug 30, 2017

I have found a solution for this bug, I'd like to push a pull request.

diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx
index 276f0ef..799e4bd 100644
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@@ -312,7 +312,8 @@ cdef class Tokenizer:
 
                         start = infix_end
                     span = string[start:]
-                    tokens.push_back(self.vocab.get(tokens.mem, span), False)
+                    if span:
+                        tokens.push_back(self.vocab.get(tokens.mem, span), False)
         cdef vector[const LexemeC*].reverse_iterator it = suffixes.rbegin()
         while it != suffixes.rend():
             lexeme = deref(it)

@honnibal
Copy link
Member

Thanks, this looks good!

honnibal added a commit that referenced this issue Sep 4, 2017
Fix issue #1292 and add test case for the Assertion Error
@Vimos Vimos closed this as completed Sep 6, 2017
@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants