Skip to content

Querela/thai-segmenter

Repository files navigation

Overview

tests
Travis-CI Build Status
Coverage Status
package

This package provides utilities for Thai sentence segmentation, word tokenization and POS tagging. Because of how sentence segmentation is performed, prior tokenization and POS tagging is required and therefore also provided with this package.

Besides functions for doing sentence segmentation, tokenization, tokenization with POS tagging for single sentence strings, there are also functions for working with large amounts of data in a streaming fashion. They are also accessible with a commandline script thai-segmenter that accepts file or standard in/output. Options allow working with meta-headers or tabulator separated data files.

The main functionality for sentence segmentation was extracted, reformatted and slightly rewritten from another project, Question Generation Thai.

LongLexTo is used as state-of-the-art word/lexeme tokenizer. An implementation was packaged in the above project but there are also (original?) versions github and homepage. To better use it for bulk processing in Python, it has been rewritten from Java to pure Python.

For POS tagging a Viterbi-Model with the annotated Orchid-Corpus is used, paper.

  • Free software: MIT license

Installation

pip install thai-segmenter

Documentation

To use the project:

sentence = """foo bar 1234"""

# [A] Sentence Segmentation
from thai_segmenter.tasks import sentence_segment
# or even easier:
from thai_segmenter import sentence_segment
sentences = sentence_segment(sentence)

for sentence in sentences:
    print(str(sentence))

# [B] Lexeme Tokenization
from thai_segmenter import tokenize
tokens = tokenize(sentence)
for token in tokens:
    print(token, end=" ", flush=True)

# [C] POS Tagging
from thai_segmenter import tokenize_and_postag
sentence_info = tokenize_and_postag(sentence)
for token, pos in sentence_info.pos:
    print("{}|{}".format(token, pos), end=" ", flush=True)

See more possibilities in tasks.py or cli.py.

Streaming larger sequences can be achieved like this:

# Streaming
sentences = ["sent1\n", "sent2\n", "sent3\n"]  # or any iterable (like File)
from thai_segmenter import line_sentence_segmenter
sentences_segmented = line_sentence_segmenter(sentences)

Commandline tool

This project also provides a nifty commandline tool thai-segmenter that does most of the work for you:

usage: thai-segmenter [-h] {clean,sentseg,tokenize,tokpos} ...

Thai Segmentation utilities.

optional arguments:
  -h, --help            show this help message and exit

Tasks:
  {clean,sentseg,tokenize,tokpos}
    clean               Clean input from non-thai and blank lines.
    sentseg             Sentence segmentize input lines.
    tokenize            Tokenize input lines.
    tokpos              Tokenize and POS-tag input lines.

You can run sentence segmentation like this:

thai-segmenter sentseg -i input.txt -o output.txt

or even pipe data:

cat input.txt | thai-segmenter sentseg > output.txt

Use -h/--help to get more information about possible control flow options.

You can run it somewhat interactively with:

thai-segmenter tokpos --stats

and standard input and output are used. Lines terminated with Enter are immediatly processed and printed. Stop work with key combination Ctrl + D and the --stats parameter will helpfully output some statistics.

WebApp

The project also provides a demo WebApp (using Flask and gevent) that can be installed with:

pip install -e .[webapp]

and then simply run (in the foreground):

thai-segmenter-webapp

Consider running it in a screen session.

# create the screen detached and then attach
screen -dmS thai-senseg-webapp
screen -r thai-senseg-webapp

# in the screen:
thai-segmenter-webapp

# and detach with keys [Ctrl]+[D]

Please note that it only is a demo webapp to test and visualize how the sentence segmentor works.

Development

To install the package for development:

git clone https://github.com/Querela/thai-segmenter.git
cd thai-segmenter/
pip install -e .[dev]

After changing the source, run auto code formatting with:

isort <file>.py
black <file>.py

And check it afterwards with:

flake8 <file>.py

The setup.py also contains the flake8 subcommand as well as an extended clean command.

Tests

To run the all tests run:

tox

You can also optionally run pytest alone:

pytest

Or with:

python setup.py test

Note, to combine the coverage data from all the tox environments run:

Windows
set PYTEST_ADDOPTS=--cov-append
tox
Other
PYTEST_ADDOPTS=--cov-append tox

About

Thai tokenizer, POS-tagger and sentence segmenter.

Resources

License

Stars

Watchers

Forks

Packages

No packages published