2019 Language and Intelligence Challenge: Information Extraction
- Install required packages by:
pip install -r requirements.txt
Download data: initialize and update the information-extraction-data
git submodule by git submodule init
and git submodule update
, and then unzip the data files
- sample schema:
{"object_type": "地点", "predicate": "祖籍", "subject_type": "人物"}
- sample data, with
postag
andtext
as input andspo_list
as output:{ "postag": [ {"word": "一直", "pos": "d"}, {"word": "陪", "pos": "v"}, {"word": "我", "pos": "r"}, {"word": "到", "pos": "p"}, {"word": "现在", "pos": "t"}, {"word": "是", "pos": "v"}, {"word": "歌手", "pos": "n"}, {"word": "马健涛", "pos": "nr"}, {"word": "原创", "pos": "v"}, {"word": "的", "pos": "u"}, {"word": "歌曲", "pos": "n"} ], "text": "一直陪我到现在是歌手马健涛原创的歌曲", "spo_list": [ {"predicate": "歌手", "object_type": "人物", "subject_type": "歌曲", "object": "马健涛", "subject": "一直陪我到现在"} ] }
- Train multi-label classification model: predict predicate.
- Train sequence labeling model: input text and predicate, output text labeling.
- Extract SPO from sequence labeling result.
Check report/PRML-final-project-doc-2019.pdf
for details.
- CNN, BiRNN, BiLSTM, BiLSTM with max pooling and RCNN
- BERT
- Encoder: BiLSTM and Transformer
- Decoder: CRF
- Initialize fitlog in
classification
folder:
cd classification/
fitlog init
fitlog log logs
- Initialize fitlog in
labeling
folder:
cd labeling/
fitlog init
fitlog log logs
Zhongyu Chen