-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely long single-word tokenization in French #1078
Comments
It also happens with repeated f, i and s characters. Also I wouldn't call it a performance issue, it is definitely a "bug" because it is essentially unusable for those repeated characters. |
Played around with it a bit, and found that the URL pattern is the culprit here. https://github.com/explosion/spaCy/blob/master/spacy/fr/tokenizer_exceptions.py#L211 Removing that fixes it. I am not sure why it's even there, do we really need to match url patterns for tokens? |
This URL pattern has already caused some hanging issues in the past: #957 I've digged further and it seems the matching hangs when we use the So it seems there is a bug related to how the @honnibal What do you think would be the best? edit: I agree with @Arnie0426, this is not only a performance issue but a blocker, as spaCy is unreliable using the French pipeline |
Resolve issue #1078 by simplifying URL pattern
Closing issue (see PR #1411) |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Minimal example to reproduce:
The process hangs during tokenization. Depending on the number of 'i', the process sometimes ends (after at least a minute).
If we stop the process, we get this stack trace:
This issue doesn't occur using the English model, so it seems it has to do with the tokenization exceptions.
Your Environment
The text was updated successfully, but these errors were encountered: