Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
Run the following command:
$ pip install vaporetto
You need to install the Rust compiler following the documentation beforehand.
vaporetto uses pyproject.toml
, so you also need to upgrade pip to version 19 or later.
$ pip install --upgrade pip
After setting up the environment, you can install vaporetto as follows:
$ pip install git+https://github.com/daac-tools/python-vaporetto
python-vaporetto does not contain model files. To perform tokenization, follow the document of Vaporetto to download distribution models or train your own models beforehand.
Check the version number as shown below to use compatible models:
>>> import vaporetto
>>> vaporetto.VAPORETTO_VERSION
'0.6.3'
Examples:
# Import vaporetto module
>>> import vaporetto
# Load the model file
>>> with open('tests/data/vaporetto.model', 'rb') as fp:
... model = fp.read()
# Create an instance of the Vaporetto
>>> tokenizer = vaporetto.Vaporetto(model, predict_tags = True)
# Tokenize
>>> tokenizer.tokenize_to_string('γΎγη€Ύι·γ―η«ζη«γ ')
'γΎγ/εθ©/γγΌ η€Ύι·/εθ©/γ·γ£γγ§γΌ γ―/ε©θ©/γ― η«ζ/εθ©/γ«γ»γΌ η«/εθ©/γγ³ γ /ε©εθ©/γ'
>>> tokens = tokenizer.tokenize('γΎγη€Ύι·γ―η«ζη«γ ')
>>> len(tokens)
6
>>> tokens[0].surface()
'γΎγ'
>>> tokens[0].tag(0)
'εθ©'
>>> tokens[0].tag(1)
'γγΌ'
>>> [token.surface() for token in tokens]
['γΎγ', 'η€Ύι·', 'γ―', 'η«ζ', 'η«', 'γ ']
The distributed models are compressed in zstd format. If you want to load these compressed models, you must decompress them outside the API.
>>> import vaporetto
>>> import zstandard # zstandard package in PyPI
>>> dctx = zstandard.ZstdDecompressor()
>>> with open('tests/data/vaporetto.model.zst', 'rb') as fp:
... with dctx.stream_reader(fp) as dict_reader:
... tokenizer = vaporetto.Vaporetto(dict_reader.read(), predict_tags = True)
You can also use KyTea's models as follows:
>>> with open('path/to/jp-0.4.7-5.mod', 'rb') as fp: # doctest: +SKIP
... tokenizer = vaporetto.Vaporetto.create_from_kytea_model(fp.read())
Note: Vaporetto does not support tag prediction with KyTea's models.
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
See the guidelines.