-
Notifications
You must be signed in to change notification settings - Fork 16
Node Parsing char_type
Brooke M. Fujita edited this page Feb 9, 2015
·
2 revisions
When parsing nodes, each Natto::MeCabNode
will have a char_type
mapping the leading char
to the following:
-
0
- DEFAULT -
1
- SPACE -
2
- KANJI -
3
- SYMBOL -
4
- NUMERIC -
5
- ALPHA -
6
- HIRAGANA -
7
- KATAKANA -
8
- KANJINUMERIC -
9
- GREEK -
10
- CYRILLIC
An example
# -F ... short-form for --node-format
# %m ... surface
# %f[0] ... part-of-speech (first element of IPADIC ChaSen feature)
# %t ... char_type
# enclosing the options in single-quotes preserves the \t and \n
#
nm = Natto::MeCab.new('-F %m\t%f[0]\t%t\n')
puts nm.parse(’こんにちは。吾輩の名は「ブルザエモン」である。’)
こんにちは 感動詞 6
。 記号 3
吾輩 名詞 2
の 助詞 6
名 名詞 2
は 助詞 6
「 記号 3
ブルザエモン 名詞 7
」 記号 3
で 助動詞 6
ある 助動詞 6
。 記号 3
EOS
It was discovered empirically that the char_type
values map to the corresponding index order as defined at the top of char.def
. No other documentation from MeCab can be found for char_type
文字種定義. c.f. this blog page (日本語)