Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
data_preprocessing.py		data_preprocessing.py
getdata.sh		getdata.sh
install_BNP.py		install_BNP.py
langconv.py		langconv.py
t2s.py		t2s.py
zh_wiki.py		zh_wiki.py

README.md

Data Pre-processing

Run getdata.sh under that directory to obtain and pre-process the data. This script will download and process the official data from UD. For CTB5, CTB6, CTB7, and CTB9, you need to obtain the official data yourself, and then put the raw data folder under the data_preprocessing directory. The folder name for the CTB datasets should be:

CTB5: LDC05T01
CTB6: LDC07T36
CTB7: LDC10T07
CTB9: LDC2016T13

This script will also download the Stanford CoreNLP Toolkit v3.9.2 (SCT) and Berkeley Neural Parser (BNP) from their official website, which are used to obtain the auto-analyzed syntactic knowledge. If you only want to use the knowledge from SCT, you can comment out the script to download BNP in getdata.sh. If you want to use the auto-analyzed knowledge from BNP, you need to download both SCT and BNP, because BNP relies on the segmentation results from SCT.

To run SCT, you need java 8; to run BNP, you need tensorflow==1.1.3.

You can refer to their websites for more information.

All processed data will appear in data directory organized by the datasets, where each of them contains the files with the same file names in the sample_data folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_preprocessing

data_preprocessing

README.md

Data Pre-processing

Files

data_preprocessing

Directory actions

More options

Directory actions

More options

Latest commit

History

data_preprocessing

Folders and files

parent directory

README.md

Data Pre-processing