Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble (EMNLP 2020)
CRF as Stacked Model and DeepCut as Baseline model
- Paper: Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble
- Blog: ตัดคำภาษาไทยและภาษาอื่นไปถึงไหนกันแล้ว?
- New verision (ACL2021): OSKut (Out-of-domain StacKed cut for Word Segmentation)
@inproceedings{limkonchotiwat-etal-2020-domain,
title = "Domain Adaptation of {T}hai Word Segmentation Models using Stacked Ensemble",
author = "Limkonchotiwat, Peerat and
Phatthiyaphaibun, Wannaphong and
Sarwar, Raheem and
Chuangsuwanich, Ekapol and
Nutanong, Sarana",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.315",
}
pip install sefr_cut
- python >= 3.6
- python-crfsuite >= 0.9.7
- pyahocorasick == 1.4.0
- Example files are on SEFR Example notebook
- Try it on Colab
- ws1000, tnhc, and BEST !!
- ws1000: The model trained on Wisesight-1000 and test on Wisesight-160
- tnhc: The model trained on TNHC (80:20 train&test split with random seed 42)
- BEST: The model trained on BEST-2010 Corpus (NECTEC)
sefr_cut.load_model(engine='ws1000') # OR sefr_cut.load_model(engine='tnhc') # OR sefr_cut.load_model(engine='best')
- tl-deepcut-XXXX
- We also provide transfer learning of deepcut on 'Wisesight' as tl-deepcut-ws1000 and 'TNHC' as tl-deepcut-tnhc
sefr_cut.load_model(engine='tl-deepcut-ws1000') # OR sefr_cut.load_model(engine='tl-deepcut-tnhc')
- deepcut
- We also provide the original deepcut
sefr_cut.load_model(engine='deepcut')
You need to read the paper to understand why we have
- Tokenize with default k-value
sefr_cut.load_model(engine='ws1000') print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'])) print(sefr_cut.tokenize(['สวัสดีประเทศไทย'])) print(sefr_cut.tokenize('สวัสดีประเทศไทย')) [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']] [['สวัสดี', 'ประเทศ', 'ไทย']] [['สวัสดี', 'ประเทศ', 'ไทย']]
- Tokenize with a various k-value
sefr_cut.load_model(engine='ws1000') print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=5)) # refine only 5% of character number print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=100)) # refine 100% of character number [['สวัสดี', 'ประเทศไทย'], ['ลุงตู่', 'สู้', 'ๆ']] [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']]
- We also provide Character & Word Evaluation by call function
evaluation()
- For example
answer = 'สวัสดี|ประเทศไทย' pred = 'สวัสดี|ประเทศ|ไทย' char_score,word_score = sefr_cut.evaluation(answer,pred) print(f'Word Score: {word_score} Char Score: {char_score}') Word Score: 0.4 Char Score: 0.8 answer = ['สวัสดี|ประเทศไทย'] pred = ['สวัสดี|ประเทศ|ไทย'] char_score,word_score = sefr_cut.evaluation(answer,pred) print(f'Word Score: {word_score} Char Score: {char_score}') Word Score: 0.4 Char Score: 0.8 answer = [['สวัสดี|'],['ประเทศไทย']] pred = [['สวัสดี|'],['ประเทศ|ไทย']] char_score,word_score = sefr_cut.evaluation(answer,pred) print(f'Word Score: {word_score} Char Score: {char_score}') Word Score: 0.4 Char Score: 0.8
- You can re-train the model. The example is in the folder Notebooks We provided everything for you!!
- You can run the notebook file #2, the corpus inside 'Notebooks/corpus/' is Wisesight-1000, you can try with BEST, TNHC, and LST20 !
- Rename variable name:
CRF_model_name
- Link:HERE
- Set variable name
CRF_model_name
same as file#2 - If you want to know why we use filter-and-refine, you can try to uncomment 3 lines in
score_()
function
#answer = scoring_function(y_true,cp.deepcopy(y_pred),entropy_index_og) #f1_hypothesis.append(eval_function(y_true,answer)) #ax.plot(range(start,K_num,step),f1_hypothesis,c="r",marker='o',label='Best case')
- Link:HERE
- Just move your model inside 'Notebooks/model/' to 'seft_cut/model/' and call model in one line.
SEFR_CUT.load_model(engine='my_model')
Thank you many code from
- Deepcut (Baseline Model) : We used some of code from Deepcut to perform transfer learning
- @bact (CRF training code) : We used some from https://github.com/bact/nlp-thai