-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cannot analyze ̄ ̄
with japanese models
#5961
Comments
Thanks for the report! This is definitely a bug. @hiroshi-matsuda-rit: I don't know whether you'd have time to look into this? I don't speak Japanese, so I'm not sure about the tokenization issues. It looks to me, from a first inspection, that the |
This behavior might be coming from SudachiPy. |
@sorami Could you help us? |
Looks like it's a macron character? Wouldn't be used in normal Japanese, but might be used in romaji. https://www.fileformat.info/info/unicode/char/0304/index.htm I suspect this has to do with how SudachiPy normalizes characters, this was a vaguely similar issue. |
The |
@adrianeboyd The third token of the output of SudachiPy for example sentence is starting with whitespace and it's unexpected behavior for current Japanese lang model. |
After some workarounds, I decided to set space after field of each token by referring the surface of next token instead next char in text. |
@sorami It seems SudachiPy has some inconsistency on dictionary_form and reading_form fields while analyzing the contexts including specific symbol chars after white space. @svlandeg @adrianeboyd I think we can release a bug-fix version even if SudachiPy is not fixed. |
Hello, I realize this topic is closed but I recently ran into a similar problem when attempting to read text containing the character |
My impression is that spaCy should not throw an exception on any text you throw at it. However, that means that it will process even garbage. It looks like you have a |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
When I tried the following very small script
I got the following error
The minimal Dockerfile is here
Your Environment
The text was updated successfully, but these errors were encountered: