Node Parsing char_type

Appendix D: Node Parsing and `char_type`

When parsing nodes, each Natto::MeCabNode will have a char_type mapping the leading char to the following:

0 - DEFAULT
1 - SPACE
2 - KANJI
3 - SYMBOL
4 - NUMERIC
5 - ALPHA
6 - HIRAGANA
7 - KATAKANA
8 - KANJINUMERIC
9 - GREEK
10 - CYRILLIC

An example

# -F    ... short-form for --node-format
# %m    ... surface
# %f[0] ... part-of-speech (first element of IPADIC ChaSen feature)
# %t    ... char_type
# enclosing the options in single-quotes preserves the \t and \n
#
nm = Natto::MeCab.new('-F %m\t%f[0]\t%t\n')
puts nm.parse(’こんにちは。吾輩の名は「ブルザエモン」である。’)

こんにちは      感動詞  6
。      記号    3
吾輩    名詞    2
の      助詞    6
名      名詞    2
は      助詞    6
「      記号    3
ブルザエモン    名詞    7
」      記号    3
で      助動詞  6
ある    助動詞  6
。      記号    3
EOS

It was discovered empirically that the char_type values map to the corresponding index order as defined at the top of char.def. No other documentation from MeCab can be found for char_type 文字種定義. c.f. this blog page (日本語)

Previous | Home | Next

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node Parsing char_type

Appendix D: Node Parsing and `char_type`

Clone this wiki locally

Node Parsing char_type

Appendix D: Node Parsing and char_type

Clone this wiki locally

Appendix D: Node Parsing and `char_type`