Skip to content
This repository has been archived by the owner on Mar 9, 2023. It is now read-only.

JoinKatakana plugin behaves differently from the Java version #162

Closed
kazuma-t opened this issue Sep 6, 2021 · 1 comment
Closed

JoinKatakana plugin behaves differently from the Java version #162

kazuma-t opened this issue Sep 6, 2021 · 1 comment

Comments

@kazuma-t
Copy link
Member

kazuma-t commented Sep 6, 2021

The JoinKatakana plugin always creates OOV nodes when concatenating nodes in concatenate_oov(). The Java version uses Lattice#getMinimumNode() to return the node with the lowest cost if there are nodes within the same range.

Sudachi (Java version)

=== Input dump:
オバケ
=== Lattice dump:
0: 9 9 (null)(0) BOS/EOS 0 0 0: 50 50 -739 -286 -944 211 -250 -163 -205 -852 -852 50 -739 -286 -944 211 -250 -852 -852 -955 50 -739 -286 -944 211 -250
1: 0 9 オバケ(816334) 名詞,普通名詞,一般,*,*,* 5139 5139 10000: 893
...
51: 0 3 オ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: -640
52: 0 0 (null)(0) BOS/EOS 0 0 0: 0
=== Before rewriting:
0: 0 3 オ(185851) 67 5946 5946 5621
1: 3 9 バケ(233719) 3 5142 5142 3446
=== After rewriting:
0: 0 9 オバケ(816334) 3 5139 5139 10000
===
オバケ  名詞,普通名詞,一般,*,*,*        お化け
EOS

SudachiPy

=== Inupt dump:
オバケ
=== Lattice dump:
1: 9 9 (null)(0) BOS/EOS 0 0 0: 50 50 -739 -286 -944 211 -250 -163 -205 -852 -852 50 -739 -286 -944 211 -250 -852 -852 -955 50 -739 -286 -944 211 -250
2: 0 9 オバケ(816309) 名詞,普通名詞,一般,*,*,* 5139 5139 10000: 893
...
41: 0 0 (null)(0) BOS/EOS 0 0 0: 0
=== Before Rewriting:
0: 0 3 オ(185851) 5946 5946 5621�
1: 3 9 バケ(233719) 5142 5142 3446�
=== After Rewriting:
0: 0 9 オバケ(0) 0 0 0�
===
オバケ  名詞,普通名詞,一般,*,*,*        オバケ
EOS
kazuma-t added a commit that referenced this issue Sep 10, 2021
kazuma-t added a commit that referenced this issue Sep 10, 2021
@kazuma-t
Copy link
Member Author

Fixed in #163

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant