-
Notifications
You must be signed in to change notification settings - Fork 16
Quick Start
Brooke M. Fujita edited this page Feb 10, 2015
·
1 revision
Here's a guide to getting started with MeCab parsing using natto
.
This requires:
- Ruby 1.9 or greater
- an existing installation of MeCab with a system dictionary
- either:
- use automatic configuration: just make sure that
mecab
(andmecab-config
if you are on Mac OS or *nix) are on your PATH - or explicit configuration:
MECAB_PATH
environment variable set to the full path to themecab
library
- use automatic configuration: just make sure that
-
First create an instance of a
Natto::MeCab
parser:require 'natto' nm = Natto::MeCab.new => #<Natto::MeCab:0x288f6d08 @tagger=#<FFI::Pointer address=0x28d3ab80>, @libpath="/usr/local/lib/libmecab.so", @options={}, @dicts=[#<Natto::DictionaryInfo:0x288f6ba0 @filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic", charset=utf-8, type=0>], @version=0.996>
-
Query the
Natto::MeCab
parser for its MeCab version and absolute path to MeCab library:puts nm.version => 0.996 puts nm.libpath => /usr/local/lib/libmecab.so
-
Fetch information about the dictionary used by the
Natto::MeCab
parser:puts nm.dicts.first.filepath => /usr/local/lib/mecab/dic/ipadic/sys.dic puts nm.dicts.first.charset => utf-8
-
Use the
parse
method to tokenize a Japanese sentence, treating the result as a single string, and print the output to screen:puts nm.parse('この星の一等賞になりたいの卓球で俺は、そんだけ!') この 連体詞,*,*,*,*,*,この,コノ,コノ 星 名詞,一般,*,*,*,*,星,ホシ,ホシ の 助詞,連体化,*,*,*,*,の,ノ,ノ 一等 名詞,一般,*,*,*,*,一等,イットウ,イットー 賞 名詞,接尾,一般,*,*,*,賞,ショウ,ショー に 助詞,格助詞,一般,*,*,*,に,ニ,ニ なり 動詞,自立,*,*,五段・ラ行,連用形,なる,ナリ,ナリ たい 助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ の 助詞,連体化,*,*,*,*,の,ノ,ノ 卓球 名詞,サ変接続,*,*,*,*,卓球,タッキュウ,タッキュー で 助詞,格助詞,一般,*,*,*,で,デ,デ 俺 名詞,代名詞,一般,*,*,*,俺,オレ,オレ は 助詞,係助詞,*,*,*,*,は,ハ,ワ 、 記号,読点,*,*,*,*,、,、,、 そん 名詞,一般,*,*,*,*,そん,ソン,ソン だけ 助詞,副助詞,*,*,*,*,だけ,ダケ,ダケ ! 記号,一般,*,*,*,*,!,!,! EOS
-
Parse the given text into an enumeration of nodes. By providing a block to
parse
, a mecab node representing each morpheme and carrying much more detailed information is yielded:nm.parse('飛べねえ鳥もいるってこった。') do |n| puts "#{n.surface}\t#{n.wcost}" if n.is_nor? end 飛べ 7175 ねえ 6661 鳥 4905 も 4669 いる 9109 って 6984 こっ 9587 た 5500 。 215
-
Combine node-parsing with a custom
node-format
for more interesting processing:# -F ... short-form of --node-format # %m ... morpheme # %h ... part-of-speech ID (IPADIC) # %f[0] ... part-of-speech (first ChaSen feature element) nm = Natto::MeCab.new('-F%m\t%h\t%f[0]') # only output feature attribute of normal nodes, # ignoring end-of-sentence or unknown nodes nm.parse('あんたはオイラに飛び方を教えてくれた。') do |n| puts n.feature if n.is_nor? end あんた 59 名詞 は 16 助詞 オイラ 59 名詞 に 13 助詞 飛び 31 動詞 方 57 名詞 を 13 助詞 教え 31 動詞 て 18 助詞 くれ 33 動詞 た 25 助動詞 。 7 記号