Skip to content

Node Parsing char_type

Brooke M. Fujita edited this page Feb 9, 2015 · 2 revisions

Appendix D: Node Parsing and char_type

When parsing nodes, each Natto::MeCabNode will have a char_type mapping the leading char to the following:

  • 0 - DEFAULT
  • 1 - SPACE
  • 2 - KANJI
  • 3 - SYMBOL
  • 4 - NUMERIC
  • 5 - ALPHA
  • 6 - HIRAGANA
  • 7 - KATAKANA
  • 8 - KANJINUMERIC
  • 9 - GREEK
  • 10 - CYRILLIC

An example

# -F    ... short-form for --node-format
# %m    ... surface
# %f[0] ... part-of-speech (first element of IPADIC ChaSen feature)
# %t    ... char_type
# enclosing the options in single-quotes preserves the \t and \n
#
nm = Natto::MeCab.new('-F %m\t%f[0]\t%t\n')
puts nm.parse(’こんにちは。吾輩の名は「ブルザエモン」である。’)

こんにちは      感動詞  6
。      記号    3
吾輩    名詞    2
の      助詞    6
名      名詞    2
は      助詞    6
「      記号    3
ブルザエモン    名詞    7
」      記号    3
で      助動詞  6
ある    助動詞  6
。      記号    3
EOS

It was discovered empirically that the char_type values map to the corresponding index order as defined at the top of char.def. No other documentation from MeCab can be found for char_type 文字種定義. c.f. this blog page (日本語)


Previous | Home | Next

Clone this wiki locally