A Keras implementation of a deep learning network to simultaneously perform Word Segmentation and Part-of-Speech (POS) Tagging introduced by Bouy et al. in the paper Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning.
tensorflow==2.7.0
{
"training": {
"batch_size": 128, // The batch size during training
"learning_rate": 0.001 // The learning rate
},
"model": {
"num_stacks": 2, // The number of LSTM layer stacks.
"hidden_layers_dim": 100, // The number of units for each hidden LSTM layers.
"max_sentence_length": 687 // The maximum number of characters in a sentence.
}
}
This repo expects datasets as text files in the below format. The sentence and sentence_tag are separated by a \t
character.
sentence sentence_tag
Sample:
ផលិត^កម្ម /NN/NS/NS/NS/NS/NS/NS/NS/NS
នេះគឺ_ជាទេព្យផល្គុន /DT/NS/NS/VB/NS/NS/NS/NS/PN/NS/NS/NS/NS/PN/NS/NS/NS/NS/NS
...
python train.py config train_set char_map pos_map --shuffle=False --epochs=300 --output_dir=output
positional arguments:
config path to config file.
train_set path to training dataset.
char_map path to characters map file.
pos_map path to pos map file.
optional arguments:
-h, --help show this help message and exit.
--shuffle [SHUFFLE] whether to shuffle the dataset when creating the batch.
--epochs EPOCHS the number of epochs to train.
--output_dir OUTPUT_DIR path to output directory.
This repo expects datasets as text files in the below format. The sentence and sentence_tag are separated by a \t
character.
sentence sentence_tag
Sample:
ផលិត^កម្ម /NN/NS/NS/NS/NS/NS/NS/NS/NS
នេះគឺ_ជាទេព្យផល្គុន /DT/NS/NS/VB/NS/NS/NS/NS/PN/NS/NS/NS/NS/PN/NS/NS/NS/NS/NS
...
python evaluate.py config test_set char_map pos_map weights --output_dir=output
positional arguments:
config path to config file.
test_set path to test dataset.
char_map path to characters map file.
pos_map path to pos map file.
weights path to weights file.
optional arguments:
-h, --help show this help message and exit
--output_dir OUTPUT_DIR path to output directory.
You can access a pretrained weights here. The network was trained for 12 epochs on a modified version of the khPOS's train.all2 dataset. The original data consists of 12000 sentences. However, for the pretrained weights, the sentences is splitted into sentences chunks. The resulting dataset consists of 2,172,051 samples. See utils/prepare_khpos_dataset.py to understand the data conversion process.
You can convert the pretrained weights into a consolidated Keras format or tflite using the below command
python convert.py config weights char_map pos_map --output_type=keras --output_dir=output
positional arguments:
config path to config file.
weights path to the weight file.
char_map path to characters map file.
pos_map path to pos map file.
optional arguments:
-h, --help show this help message and exit.
--output_dir OUTPUT_DIR path to output directory.
--output_type OUTPUT_TYPE the type of the output model. One of type: "keras", "tflite"
Test Set | POS Tag | Tag Accuracy (%) | POS Tagging Accuracy (%) |
---|---|---|---|
khPOS OPEN-TEST | AB | 100.00 | 94.09 |
AUX | 96.82 | ||
CC | 96.67 | ||
CD | 97.55 | ||
DT | 97.87 | ||
IN | 93.75 | ||
JJ | 80.39 | ||
VB | 91.44 | ||
NN | 95.17 | ||
PN | 93.88 | ||
PA | 75.68 | ||
PRO | 98.80 | ||
QT | 80.00 | ||
RB | 88.99 | ||
SYM | 97.81 | ||
khPOS CLOSE-TEST | AB | 100.00 | 99.20 |
AUX | 100.00 | ||
CC | 99.52 | ||
CD | 100.00 | ||
DT | 100.00 | ||
IN | 99.81 | ||
JJ | 99.15 | ||
VB | 99.39 | ||
NN | 99.88 | ||
PN | 97.18 | ||
PA | 87.32 | ||
PRO | 99.74 | ||
QT | 100.00 | ||
RB | 99.14 | ||
SYM | 100.00 |
- Buoy, R., Taing, N., & Kor, S. (2021). Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning. Retrieved from https://arxiv.org/abs/2103.16801
- Loem, M. (2021, May 4). Joint Khmer Word Segmentation and POS tagging. Medium. Retrieved from https://towardsdatascience.com/joint-khmer-word-segmentation-and-pos-tagging-cad650e78d30
- Ye, K. T., Vichet, C., & Yoshinori, S. (2017). Comparison of Six POS Tagging Methods on 12K Sentences Khmer Language POS Tagged Corpus. First Regional Conference on Optical character recognition and Natural language processing technologies for ASEAN languages (ONA 2017). Retrieved from https://github.com/ye-kyaw-thu/khPOS/blob/master/khpos.pdf