Chinese-Punctuation-Restoration-with-Bert-CNN-RNN

This repository is developed from a backbone in repo BertPunc. Then, We impleted our original ideas of Word-level BERT-CNN-RNN Model for Chinese Punctuation Restoration

Requirment

Required package.

torch==1.1.0
numpy==1.19.4
scikit_learn==0.23.2
tqdm==4.54.1
transformers==4.0.1

Install packages for project.

pip install -r requirements.txt

1. Difference from My Previous Repo

previous repo punctuation-restoration-pytorch

Previous work consists a simple BiLSTM network for punctuation restoration. We then tried integrating CNN with the BiLSTM and attention. However, CNN and Attention didn't show any improvement for Chinese Punctation. A seq to seq mechanism also performed baddly on Chinese punctuation restoration task.

In this work, we bring the bert.But bert has been widly used in many works, for acheive a more meaningful work, we bring the insight of word-level concept in our work.

Bert and it's variants rely a character tokenizer of Chinese. Unlike English word tokenizer remaining mostly the word semantic of the english word. Chinese tokenizer just split chinese sentence into characters which don't always represent a complete semantic. It will greatly influnce the model's capability. As you can easily imaging, when a pretrained model doing a task related fine-tuning, using a character tokenizer will make model concentrating more on the character information. Some word level relation even will be forgot.

2. Methods Details

Our model use two types features for final punctuation predictions:

word-level features: Well designed CNN layers. As Figure 1.
Character level features: Bert outputs. As Figure 2.

Figure 1. Word-Level features

Figure 2. Character-level features

3. Code

train_1_to_1.py : Training code. (train.py is the original train process in BerPunc. It use a bert outputs of 32 neighbor character to predict one character's punctuation.)
data_1_to_1.py : Helper function to read and transform data. (data.py the original data helper.)
model_1_to_1.py : Consists a lots of our original models. best model--> BertChineseEmbSlimCNNlstmBert. (model_1_to_1_seg.py is a model to integrated a fine-tuned bert model on segmentation task with our model.)
evaluate_*.ipynb : evaluation jupyter notebook on IWSLT Chinese test sets.
./data : Some train and test data.

4. Experiments Results

We conducted experiments on Chinese Ted talks transcripts in IWSLT2012. Besides, We also conducted on other datasets like PeopleDaliy and reading books. The results can vary. The text with good grammar can acheive better results. The voice transcripts performed lower results. There are more works to do.

Test results for best model BertChineseEmbSlimCNNlstmBert in evaluate_iwslt-BertChineseEmbSlimCNNlstmBert.ipynb. As follow:

Figure 3. Experiments Results

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
img		img
log		log
models		models
2_evaluate_iwslt-RobertaChineseEmbSlimCNNlstmBert.ipynb		2_evaluate_iwslt-RobertaChineseEmbSlimCNNlstmBert.ipynb
2_evaluate_iwslt_RobertaChineseBase.ipynb		2_evaluate_iwslt_RobertaChineseBase.ipynb
LICENSE		LICENSE
README.md		README.md
data.py		data.py
data_1_to_1.py		data_1_to_1.py
evaluate_BertChineseBigStrideCNN.ipynb		evaluate_BertChineseBigStrideCNN.ipynb
evaluate_BertChineseSegHiddenLinearPunc.ipynb		evaluate_BertChineseSegHiddenLinearPunc.ipynb
evaluate_chinese_base_len200_bs10.ipynb		evaluate_chinese_base_len200_bs10.ipynb
evaluate_chinese_cnn_bert_len200_bs10.ipynb		evaluate_chinese_cnn_bert_len200_bs10.ipynb
evaluate_iwslt-BertChineseEmbSlimCNNBert.ipynb		evaluate_iwslt-BertChineseEmbSlimCNNBert.ipynb
evaluate_iwslt-BertChineseEmbSlimCNNlstmBert.ipynb		evaluate_iwslt-BertChineseEmbSlimCNNlstmBert.ipynb
evaluate_iwslt-BertChineseEmbSlimCNNlstmBertLSTM.ipynb		evaluate_iwslt-BertChineseEmbSlimCNNlstmBertLSTM.ipynb
evaluate_iwslt-BertChineseLSTMLinearPunc.ipynb		evaluate_iwslt-BertChineseLSTMLinearPunc.ipynb
evaluate_iwslt-BertChineseSlimCNNBert.ipynb		evaluate_iwslt-BertChineseSlimCNNBert.ipynb
evaluate_iwslt-BertChineseSlimCNNBertLSTM.ipynb		evaluate_iwslt-BertChineseSlimCNNBertLSTM.ipynb
evaluate_iwslt_BertChineseBase.ipynb		evaluate_iwslt_BertChineseBase.ipynb
evaluate_iwslt_BertChineseLSTMLinearPunc.ipynb		evaluate_iwslt_BertChineseLSTMLinearPunc.ipynb
evaluate_pfdsj-BertChineseBigStrideCNN.ipynb		evaluate_pfdsj-BertChineseBigStrideCNN.ipynb
model.py		model.py
model_1_to_1.py		model_1_to_1.py
model_1_to_1_seg.py		model_1_to_1_seg.py
requirements.txt		requirements.txt
train.py		train.py
train.sh		train.sh
train_1_to_1.py		train_1_to_1.py
train_1_to_1_iwslt.py		train_1_to_1_iwslt.py
train_1_to_1_pfdsj.py		train_1_to_1_pfdsj.py
train_1_to_1_seg.py		train_1_to_1_seg.py
train_seg.sh		train_seg.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chinese-Punctuation-Restoration-with-Bert-CNN-RNN

Requirment

1. Difference from My Previous Repo

2. Methods Details

3. Code

4. Experiments Results

About

Releases

Packages

Languages

License

yuboona/Chinese-Punctuation-Restoration-with-Bert-CNN-RNN

Folders and files

Latest commit

History

Repository files navigation

Chinese-Punctuation-Restoration-with-Bert-CNN-RNN

Requirment

1. Difference from My Previous Repo

2. Methods Details

3. Code

4. Experiments Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages