-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
German: Quotation marks not correctly tokenized #596
Comments
Thanks! A work-around until this is fixed: import spacy.de
import spacy
spacy.de.German.Defaults.prefixes += tuple([u'"'])
nlp = spacy.load('de') I haven't test this yet, but it should work. Basically, The source data is in |
Thank you for your quick reply! Adding the quotation mark to the prefixes solved my problem. However, I still believe there is something wrong with the TOKENIZER_PREFIXES, since the quotation mark is already on this list, so it should work by default. Greetings! |
Hmm, I assumed it was a different quotation mark that just looked similar. Is it really the same character? If so then yes, something's wrong. |
It is. I also tested it with the octothorpe that is definitely on the list and "#Ich" is also tokenized as one token. |
Hm! Think I see the problem. |
…up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596
Argh, Python 2 and it's encoding issues... Thanks for fixing it so quickly! |
I run into very similar issues with parantheses (, [, and {
gives
while if I add
Looking at the default |
@schlichtanders Thanks for the report and your analysis – this makes a lot of sense. I'll add a test and take care of it! |
thanks for the immediate reaction. |
@schlichtanders Hmm, so I haven't managed to reproduce this error. Tested it with both the German model and just the tokenizer, and it's tokenized correctly:
Which version of spaCy are you using? Btw, this doesn't actually seem like a problem in your case, but just so you know: |
I am very sorry. thanks for coming back to me so quickly |
No worries – glad to hear it's working! 👍 |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
In some cases, a quotation mark is not separated from the following token.
Example
Your Environment
The text was updated successfully, but these errors were encountered: