-
Notifications
You must be signed in to change notification settings - Fork 16
Output Formatting
MeCab's default ChaSen output format take the following form:
surface \t feature
-
surface
is the morpheme itself -
feature
is a comma-delimited string of the following elements:- part-of-speech
- sub-class 1
- sub-class 2
- sub-class 3
- inflection
- conjugation
- root-form
- reading
- pronunciation
Please refer to MeCab: Yet Another Part-of-Speech and Morphological Analyzer, とりあえず解析してみる.
It is possible to override the output format by customizing the Natto::MeCab
node output format using the following macros.
Macro | Definition |
---|---|
%s |
node stat status value: 0 normal, 1 unknown, 2 sentence start, 3 sentence end |
%S |
the input sentence |
%L |
length of input sentence (bytes) |
%m |
morpheme surface |
%M |
morpheme surface including leading whitespace (c.f. %pS ) |
%h |
part-of-speech ID |
%% |
% char (escaped) |
%c |
word cost |
%H |
comma-delimited list of POS, conjugation, reading, etc. |
%t |
character type id |
%P |
marginal probability (only with -l2 option) |
%pi |
unique node ID |
%pS |
morpheme including any leading whitespace; same as %pS%m and %M
|
%ps |
start position |
%pe |
end position |
%pC |
accumulative cost from previous node to this one |
%pw |
same as %c
|
%pc |
accumulative cost + word cost (from sentence start) |
%pn |
accumulative cost + word cost (this morpheme only, %pw + %pC ) |
%pb |
* for most optimal path; whitespace otherwise |
%pP |
marginal probability (only with -l2 option) |
%pA |
alpha, forward log probability (only with -l2 option) |
%pB |
beta, backward log probability (only with -l2 option) |
%pl |
length of morpheme (bytes), same as strlen (%m)
|
%pL |
length of morpheme including any whitespace (bytes), same as strlen(%M) ) |
%phl |
left path id |
%phr |
right path id |
%f[N] |
Nth element of MeCab's default output feature |
%f[N1,N2,N3...] |
N1,N2,N3... elements of MeCab's default output feature, tab-separated |
%FC[N1,N2,N3...] |
N1,N2,N3... elements of MeCab's default output feature, delimited with char C ; any whitespace elements are not output |
\0 \a \b \t \n \v \f \r \\ |
the usual string formatters |
\s |
' ' (half-width whitespace) |
You can define custom output formats using the above macros by using the --node-format
, --unk-format
, --bos-format
, --eos-format
or --eon-format
options when instantiating Natto::MeCab
.
Example 1: Specifying user-defined formats:
# pseudo-code, you would have to specify the output format macros as STR
# long-format style
nm = Natto::MeCab.new('--node-format=STR --bos-format=STR --eos-format=STR --unk-format=STR')
# short-format style
nm = Natto::MeCab.new('-F STR -B STR -E STR -U STR')
# Ruby hash
nm = Natto::MeCab.new(node-format: 'STR', bos-format: 'STR', eos-format: 'STR', unk-format: 'STR')
It is also possible to define your custom output formats in the $MECAB_HOME/etc/mecabrc
configuration file.
Example 2: Adding user-defined output formats to $MECAB_HOME/etc/mecabrc
# pseudo-code, you would have to specify the output format macros as STR
node-format-KEY = STR
unk-format-KEY = STR
eos-format-KEY = STR
bos-format-KEY = STR
eon-format-KEY = STR
Use the --output-format-type
option to specify the user-defined output format KEY
.
Example 3. Specifying user-defined output format KEY
# long-format style
nm = Natto::MeCab.new('--output-format-type=KEY')
# short-format style
nm = Natto::MeCab.new('-O KEY')
# Ruby hash
nm = Natto::MeCab.new(output_format_type: 'KEY')
Further details may be found at MeCab: Yet Another Part-of-Speech and Morphological Analyzer, 出力フォーマットの指定