Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot analyze ̄ ̄ with japanese models #5961

Closed
TatsuyaShirakawa opened this issue Aug 24, 2020 · 11 comments · Fixed by #5969
Closed

cannot analyze ̄ ̄ with japanese models #5961

TatsuyaShirakawa opened this issue Aug 24, 2020 · 11 comments · Fixed by #5969
Labels
bug Bugs and behaviour differing from documentation lang / ja Japanese language data and models

Comments

@TatsuyaShirakawa
Copy link

How to reproduce the behaviour

When I tried the following very small script

import spacy
nlp = spacy.load('ja_core_news_sm')
nlp(' ̄ ̄')

I got the following error

>>> nlp(' ̄ ̄')
nlp(' ̄ ̄')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 106, in get_dtokens_and_spaces
    word_start = text[text_pos:].index(word)
ValueError: substring not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 441, in __call__
    doc = self.make_doc(text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 281, in make_doc
    return self.tokenizer(text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 145, in __call__
    dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 108, in get_dtokens_and_spaces
    raise ValueError(Errors.E194.format(text=text, words=words))
ValueError: [E194] Unable to aligned mismatched text ' ̄ ̄' and words '[' ', '̄', ' ̄']'.

The minimal Dockerfile is here

FROM python:3.7

RUN pip install spacy
RUN python -m spacy download ja_core_news_sm

Your Environment

  • Operating System: Linux 04a7a76544e5 4.19.76-linuxkit failed on build #1 SMP Thu Oct 17 19:31:58 UTC 2019 x86_64 GNU/Linux
  • Python Version Used: 3.7.7
  • spaCy Version Used: 2.3.2
  • Environment Information: Minimal Dockerfile is as bellow
FROM python:3.7

RUN pip install spacy
RUN python -m spacy download ja_core_news_sm
@svlandeg svlandeg added bug Bugs and behaviour differing from documentation lang / ja Japanese language data and models labels Aug 24, 2020
@svlandeg
Copy link
Member

svlandeg commented Aug 24, 2020

Thanks for the report! This is definitely a bug.

@hiroshi-matsuda-rit: I don't know whether you'd have time to look into this? I don't speak Japanese, so I'm not sure about the tokenization issues. It looks to me, from a first inspection, that the self._get_dtokens function includes the space within the third token, but then get_dtokens_and_spaces skips over the space which then ultimately results in an error because the third token can not be found anymore in the string. I feel like probably self._get_dtokens should be fixed somehow?

@hiroshi-matsuda-rit
Copy link
Contributor

This behavior might be coming from SudachiPy.
I'd like to research it soon.
@polm Have you encountered this kind of problems?

@hiroshi-matsuda-rit
Copy link
Contributor

@sorami Could you help us?

@polm
Copy link
Contributor

polm commented Aug 24, 2020

Looks like it's a macron character? Wouldn't be used in normal Japanese, but might be used in romaji.

https://www.fileformat.info/info/unicode/char/0304/index.htm

I suspect this has to do with how SudachiPy normalizes characters, this was a vaguely similar issue.

WorksApplications/SudachiPy#120

@adrianeboyd
Copy link
Contributor

The sudachipy analysis didn't look obviously incorrect to me, either. I suspect the problem is that the third token returned by sudachipy starts with whitespace and that throws the alignment off like Sofie described. But I'm also not sure enough about how sudachipy should work to be sure where the bug is.

@hiroshi-matsuda-rit
Copy link
Contributor

@adrianeboyd The third token of the output of SudachiPy for example sentence is starting with whitespace and it's unexpected behavior for current Japanese lang model.
In such cases, we should divide a token into a whitespace and a remaining part.
I'd make a quick fix for master branch.

@hiroshi-matsuda-rit
Copy link
Contributor

After some workarounds, I decided to set space after field of each token by referring the surface of next token instead next char in text.

@svlandeg svlandeg linked a pull request Aug 25, 2020 that will close this issue
3 tasks
@hiroshi-matsuda-rit
Copy link
Contributor

@sorami It seems SudachiPy has some inconsistency on dictionary_form and reading_form fields while analyzing the contexts including specific symbol chars after white space.

@svlandeg @adrianeboyd I think we can release a bug-fix version even if SudachiPy is not fixed.

@jpavlick
Copy link

Hello, I realize this topic is closed but I recently ran into a similar problem when attempting to read text containing the character ́. I was wondering if this is just malformed data on my part, or if the bugfix described in this issue should take care of it? And if the latter, is the bugfix already released? I didn't notice anything in 2.3.1. Thanks for your help!

@polm
Copy link
Contributor

polm commented Nov 13, 2020

My impression is that spaCy should not throw an exception on any text you throw at it. However, that means that it will process even garbage.

It looks like you have a COMBINING ACUTE ACCENT floating by itself, which is not really going to be useful. You might be able to fix it by using Unicode NFKC normalization on your input text.

@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation lang / ja Japanese language data and models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants