cannot analyze `̄ ̄` with japanese models #5961

TatsuyaShirakawa · 2020-08-24T08:06:29Z

How to reproduce the behaviour

When I tried the following very small script

import spacy
nlp = spacy.load('ja_core_news_sm')
nlp(' ̄ ̄')

I got the following error

>>> nlp(' ̄ ̄')
nlp(' ̄ ̄')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 106, in get_dtokens_and_spaces
    word_start = text[text_pos:].index(word)
ValueError: substring not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 441, in __call__
    doc = self.make_doc(text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 281, in make_doc
    return self.tokenizer(text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 145, in __call__
    dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 108, in get_dtokens_and_spaces
    raise ValueError(Errors.E194.format(text=text, words=words))
ValueError: [E194] Unable to aligned mismatched text ' ̄ ̄' and words '[' ', '̄', ' ̄']'.

The minimal Dockerfile is here

FROM python:3.7

RUN pip install spacy
RUN python -m spacy download ja_core_news_sm

Your Environment

Operating System: Linux 04a7a76544e5 4.19.76-linuxkit failed on build #1 SMP Thu Oct 17 19:31:58 UTC 2019 x86_64 GNU/Linux
Python Version Used: 3.7.7
spaCy Version Used: 2.3.2
Environment Information: Minimal Dockerfile is as bellow

FROM python:3.7

RUN pip install spacy
RUN python -m spacy download ja_core_news_sm

The text was updated successfully, but these errors were encountered:

svlandeg · 2020-08-24T09:52:45Z

Thanks for the report! This is definitely a bug.

@hiroshi-matsuda-rit: I don't know whether you'd have time to look into this? I don't speak Japanese, so I'm not sure about the tokenization issues. It looks to me, from a first inspection, that the self._get_dtokens function includes the space within the third token, but then get_dtokens_and_spaces skips over the space which then ultimately results in an error because the third token can not be found anymore in the string. I feel like probably self._get_dtokens should be fixed somehow?

hiroshi-matsuda-rit · 2020-08-24T11:08:40Z

This behavior might be coming from SudachiPy.
I'd like to research it soon.
@polm Have you encountered this kind of problems?

hiroshi-matsuda-rit · 2020-08-24T11:15:30Z

@sorami Could you help us?

polm · 2020-08-24T11:18:12Z

Looks like it's a macron character? Wouldn't be used in normal Japanese, but might be used in romaji.

https://www.fileformat.info/info/unicode/char/0304/index.htm

I suspect this has to do with how SudachiPy normalizes characters, this was a vaguely similar issue.

WorksApplications/SudachiPy#120

adrianeboyd · 2020-08-24T12:54:04Z

The sudachipy analysis didn't look obviously incorrect to me, either. I suspect the problem is that the third token returned by sudachipy starts with whitespace and that throws the alignment off like Sofie described. But I'm also not sure enough about how sudachipy should work to be sure where the bug is.

hiroshi-matsuda-rit · 2020-08-25T08:53:10Z

@adrianeboyd The third token of the output of SudachiPy for example sentence is starting with whitespace and it's unexpected behavior for current Japanese lang model.
In such cases, we should divide a token into a whitespace and a remaining part.
I'd make a quick fix for master branch.

hiroshi-matsuda-rit · 2020-08-25T10:36:58Z

After some workarounds, I decided to set space after field of each token by referring the surface of next token instead next char in text.

hiroshi-matsuda-rit · 2020-08-25T11:26:09Z

@sorami It seems SudachiPy has some inconsistency on dictionary_form and reading_form fields while analyzing the contexts including specific symbol chars after white space.

@svlandeg @adrianeboyd I think we can release a bug-fix version even if SudachiPy is not fixed.

jpavlick · 2020-11-12T21:58:58Z

Hello, I realize this topic is closed but I recently ran into a similar problem when attempting to read text containing the character ́. I was wondering if this is just malformed data on my part, or if the bugfix described in this issue should take care of it? And if the latter, is the bugfix already released? I didn't notice anything in 2.3.1. Thanks for your help!

polm · 2020-11-13T01:58:25Z

My impression is that spaCy should not throw an exception on any text you throw at it. However, that means that it will process even garbage.

It looks like you have a COMBINING ACUTE ACCENT floating by itself, which is not really going to be useful. You might be able to fix it by using Unicode NFKC normalization on your input text.

github-actions · 2021-10-30T00:01:28Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added bug Bugs and behaviour differing from documentation lang / ja Japanese language data and models labels Aug 24, 2020

sorami mentioned this issue Aug 25, 2020

Open the system dictionary in the binary read mode WorksApplications/SudachiPy#139

Merged

hiroshi-matsuda-rit mentioned this issue Aug 25, 2020

fix ja leading spaces #5969

Merged

3 tasks

svlandeg linked a pull request Aug 25, 2020 that will close this issue

fix ja leading spaces #5969

Merged

3 tasks

svlandeg closed this as completed in #5969 Aug 25, 2020

sorami mentioned this issue Sep 1, 2020

Inconsistency on dictionary_form and reading_form fields while analyzing the contexts including specific symbol chars after white space WorksApplications/SudachiPy#142

Open

github-actions bot locked as resolved and limited conversation to collaborators Oct 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot analyze `̄ ̄` with japanese models #5961

cannot analyze `̄ ̄` with japanese models #5961

TatsuyaShirakawa commented Aug 24, 2020

svlandeg commented Aug 24, 2020 •

edited

Loading

hiroshi-matsuda-rit commented Aug 24, 2020

hiroshi-matsuda-rit commented Aug 24, 2020

polm commented Aug 24, 2020

adrianeboyd commented Aug 24, 2020

hiroshi-matsuda-rit commented Aug 25, 2020

hiroshi-matsuda-rit commented Aug 25, 2020

hiroshi-matsuda-rit commented Aug 25, 2020

jpavlick commented Nov 12, 2020

polm commented Nov 13, 2020

github-actions bot commented Oct 30, 2021

cannot analyze ̄ ̄ with japanese models #5961

cannot analyze ̄ ̄ with japanese models #5961

Comments

TatsuyaShirakawa commented Aug 24, 2020

How to reproduce the behaviour

Your Environment

svlandeg commented Aug 24, 2020 • edited Loading

hiroshi-matsuda-rit commented Aug 24, 2020

hiroshi-matsuda-rit commented Aug 24, 2020

polm commented Aug 24, 2020

adrianeboyd commented Aug 24, 2020

hiroshi-matsuda-rit commented Aug 25, 2020

hiroshi-matsuda-rit commented Aug 25, 2020

hiroshi-matsuda-rit commented Aug 25, 2020

jpavlick commented Nov 12, 2020

polm commented Nov 13, 2020

github-actions bot commented Oct 30, 2021

cannot analyze `̄ ̄` with japanese models #5961

cannot analyze `̄ ̄` with japanese models #5961

svlandeg commented Aug 24, 2020 •

edited

Loading