Skip to content
This repository has been archived by the owner on Mar 9, 2023. It is now read-only.

spaCy model accuracy significantly degraded from SudachiPy v0.4.6 #129

Closed
hiroshi-matsuda-rit opened this issue Jun 17, 2020 · 6 comments
Closed

Comments

@hiroshi-matsuda-rit
Copy link
Contributor

@sorami @polm Could you research the reason of this difference between v0.4.5 and v0.4.6?

$ pip install -U sudachipy==0.4.7
$ python -m spacy evaluate ja_core_news_lg-2.3.1/ja_core_news_lg/ja_core_news_lg-2.3.1/ ja_gsd-ud-dev.ne.json

Time      1.24 s
Words     11887
Words/s   9596
TOK       91.53
POS       82.62
UAS       75.75
LAS       74.45
NER P     68.31
NER R     65.52
NER F     66.88
Textcat   0.00

$ pip install -U sudachipy==0.4.6
$ python -m spacy evaluate ja_core_news_lg-2.3.1/ja_core_news_lg/ja_core_news_lg-2.3.1/ ja_gsd-ud-dev.ne.json

Time      1.40 s
Words     11887
Words/s   8492
TOK       91.53
POS       82.62
UAS       75.75
LAS       74.45
NER P     68.31
NER R     65.52
NER F     66.88
Textcat   0.00

$ pip install -U sudachipy==0.4.5
$ python -m spacy evaluate ja_core_news_lg-2.3.1/ja_core_news_lg/ja_core_news_lg-2.3.1/ ja_gsd-ud-dev.ne.json

Time      1.35 s
Words     12121
Words/s   8990
TOK       97.67
POS       97.30
UAS       88.94
LAS       87.55
NER P     71.79
NER R     69.22
NER F     70.48
Textcat   0.00
@hiroshi-matsuda-rit
Copy link
Contributor Author

This problem was reproduced on Mac OS 10.14.6, Windows10 update 1909, and WSL with python 3.8.

@sorami
Copy link
Collaborator

sorami commented Jun 18, 2020

For v0.4.6 and v0.4.7 the major updates were only about Cythonization, so it the problem is within SudachiPy, I guess it is something to do with Cython.

Let us have a look.

polm added a commit to polm/SudachiPy that referenced this issue Jun 18, 2020
The connection costs lookup was backwards.

There was a comment in the pre-cython code that the call to the cost
lookup function looked backwards, but was actually correct. It was
calling with (l_node.right_id, r_node.left_id). I kept this order when I
replaced the function call with a memoryview access, but that was wrong;
I hadn't noticed that the access function was actually reversing the
order of its arguments.

This fix was verified by checking the tokenization of a short document
vs v0.4.5 and making sure there were no changes.
@polm
Copy link
Contributor

polm commented Jun 18, 2020

Thanks for the report. I found the cause of this, I screwed up in the Cythonization and connect costs were wrong. Just opened a PR with a fix.

@hiroshi-matsuda-rit
Copy link
Contributor Author

I strongly recommend to add the spaCy evaluation step to CI tests.
With spacy CLI and UD_Japanese-GSD v2.6-NE, you can do evaluations like as:

# prepare sudachipy module before executing below steps
$ pip install -U spacy sudachidict-core
$ python -m spacy download ja_core_news_md
$ python -m spacy evaluate ja_core_news_md ja_gsd-ud-test.ne.json

================================== Results ==================================

Time      1.29 s
Words     13053
Words/s   10131
TOK       98.11
POS       97.94
UAS       88.16
LAS       86.18
NER P     72.79
NER R     72.91
NER F     72.85
Textcat   0.00

The decline of TOK measure should be within 0.1%.

@sorami
Copy link
Collaborator

sorami commented Jun 18, 2020

@hiroshi-matsuda-rit

I've merged @polm's fix and released v0.4.8.

Sorry for the degradation, yeah we should include the spaCy evaluation step in the CI #132 (or at least test with some paragraphs)

v0.4.7

$ pip install -U sudachipy==0.4.7
$ python -m spacy evaluate ja_core_news_md ja_gsd-ud-test.ne.json

================================== Results ==================================

Time      1.10 s
Words     12817
Words/s   11630
TOK       91.93
POS       82.06
UAS       75.81
LAS       73.98
NER P     69.52
NER R     70.77
NER F     70.14
Textcat   0.00

v0.4.8

$ pip install -U sudachipy==0.4.8
$ python -m spacy evaluate ja_core_news_md ja_gsd-ud-test.ne.json

================================== Results ==================================

Time      1.20 s
Words     13053
Words/s   10871
TOK       98.11
POS       97.94
UAS       88.16
LAS       86.18
NER P     72.79
NER R     72.91
NER F     72.85
Textcat   0.00

@hiroshi-matsuda-rit
Copy link
Contributor Author

hiroshi-matsuda-rit commented Jun 18, 2020

I just tested v0.4.8 and got the same result. Thank you for quick response!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants