This repository has been archived by the owner on Mar 9, 2023. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 50
spaCy model accuracy significantly degraded from SudachiPy v0.4.6 #129
Comments
This problem was reproduced on Mac OS 10.14.6, Windows10 update 1909, and WSL with python 3.8. |
polm
added a commit
to polm/SudachiPy
that referenced
this issue
Jun 18, 2020
The connection costs lookup was backwards. There was a comment in the pre-cython code that the call to the cost lookup function looked backwards, but was actually correct. It was calling with (l_node.right_id, r_node.left_id). I kept this order when I replaced the function call with a memoryview access, but that was wrong; I hadn't noticed that the access function was actually reversing the order of its arguments. This fix was verified by checking the tokenization of a short document vs v0.4.5 and making sure there were no changes.
Thanks for the report. I found the cause of this, I screwed up in the Cythonization and connect costs were wrong. Just opened a PR with a fix. |
I strongly recommend to add the spaCy evaluation step to CI tests.
The decline of TOK measure should be within 0.1%. |
I've merged @polm's fix and released v0.4.8. Sorry for the degradation, yeah we should include the spaCy evaluation step in the CI #132 (or at least test with some paragraphs) v0.4.7
v0.4.8
|
I just tested v0.4.8 and got the same result. Thank you for quick response! |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
@sorami @polm Could you research the reason of this difference between v0.4.5 and v0.4.6?
The text was updated successfully, but these errors were encountered: