This is simple wrapper for Japanese Tokenizers(A.K.A Morphology Splitter)
This repository aims to call Tokenizer and split into tokens in one line.
- put sentence and get set of tokens
- filter some tokens with your Part-of-Speech condition or stopwords
- be able to add extension dictionary like mecab-neologd dictionary
- be bale to define your original dictionary. And this dictionary forces mecab to make it one token
This package works under both of python2x and python3x.
But I checked mainly under python2x.
So, I'm glad if you find any bugs in python3x and report it.
See here to install MeCab system.
Mecab-neologd dictionary is a dictionary-extension based on ipadic-dictionary, which is basic dictionary of Mecab.
With, Mecab-neologd dictionary, you're able to new-coming words make one token.
Here, new-coming words is suche like, movie actor name or company name.....
See here[https://github.com/neologd/mecab-ipadic-neologd] and install mecab-neologd dictionary.
execute following command
python install_python_dependencies.py
This command automatically install all libraries that this package depends on.
[sudo] python setup.py install
See examples/
to use.