Skip to content

Yet another Thai Word Segmentation that employs multiple linguistic information with attention mechanisms.

License

Notifications You must be signed in to change notification settings

tchayintr/thwcc-attn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Character-based Thai Word Segmentation with Multiple Attentions

thwcc-attn

Yet another Thai Word Segmentation that employs multiple linguistic information, including characters, character-cluster, subwords, and words, with attention mechanisms.

Architecture

  • Character-based word segmentation
  • BiLSTM-CRF architecture
  • Words and Character-cluster/subword attentions integrated to character representations
  • BIES tagging scheme
    • B: beginning, I: inside, E: end, and S: single

Segmentation Performance (micro-averaged f1 score)

CWCC-WCON

Datasets (based on BEST2010 corpus)

Requirements

  • python3 >= 3.7.3
  • torch >= 1.6.0+cu101 (original: 1.5.0+cu101)
  • allennlp >= 1.1.0 (original: 1.0.0)
  • numpy >= 1.19.2
  • pathlib >= 1.0.1
  • gensim >= 3.8.3
  • pickle

Modes

  • train: train and evaluate a model
  • decode: decode an input file (unsegmented text) to a segmented words

Data format

  • sl: sentence line

Usage

Modes can be specified by executing the following the sample scripts

Training models

  • Character-BiLSTM-CRF(baseline): sample_scripts/sample_seg_ch.sh
    • tagging_unit: single
    • ./sample_scripts/sample_seg_ch.sh
  • Character-Transformer-CRF: sample_scripts/sample_seg_tfm.sh
    • tagging_unit: transformer
    • ./sample_scripts/sample_seg_tfm_ch.sh
  • W-WCON (strong-baseline): sample_scripts/sample_seg_w.sh
    • tagging_unit: hybrid
    • word-attention
    • ./sample_scripts/sample_seg_w.sh
  • CCC-WCON (preliminary):
    • tagging_unit: mutant
    • cc-attention
    • ./sample_scripts/sample_seg_cc.sh
  • CWSW-WCON (preliminary): sample_scripts/sample_seg_wsw.sh
    • tagging_unit: sub-combinative
    • word-attention and subword-attention
    • ./sample_scripts/sample_seg_wsw.sh
  • CWCC-WCON (The best model): sample_scripts/sample_seg_wcc.sh
    • tagging_unit: combinative
    • word-attention and cc-attention
    • ./sample_scripts/sample_seg_wcc.sh

Logs

  • A log file will be saved in log
    • training/evaluating scores
    • hyperparameters

Trained models

  • Trained models will be saved in models/main
    • hyperparameters
    • dictionary
    • checkpoint for each break point

Acknowledgement

  • Implementations based on modification of seikanlp

Citation

About

Yet another Thai Word Segmentation that employs multiple linguistic information with attention mechanisms.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published