Skip to content
/ LasUIE Public

Universal Information Extraction, codes for the NeurIPS-2022 paper: Unifying Information Extraction with Latent Adaptive Structure-aware Generative Language Model.

License

Notifications You must be signed in to change notification settings

ChocoWu/LasUIE

Repository files navigation

LasUIE: Latent Adaptive Structure-aware LM for Universal Information Extraction

pytorch 1.8.1 pytorch 1.8.1 Build Status

The pytroch implementation of the NIPS-2022 paper Unifying Information Extraction with Latent Adaptive Structure-aware Generative Language Model.


🎉 Visit the project page here LasUIE


Quick Links


1. Methodology

1.1 Modeling Universal Information Extraction (UIE)

UIE has been proposed to unify all information extraction tasks in NLP community, which converts the structure prediction of IE tasks universally into the sequence prediction via generative LMs.

All IE jobs essentially revolves around predicting two key elements: <mention spans> or/and their <semantic relations>. In this project, we thus reduce all the IE tasks into three prototypes: span extraction, pair extraction and hyper-pair extraction:

  • I) Span Extraction, e.g.,

    • named entity recognition (NER)
    • aspect-based sentiment analysis (ABSA)
    • aspect-term extraction (ATE)
  • II) Pair Extraction, e.g.,

    • relation extraction (RE)
    • aspect-opinion pair extraction (AOP)
    • aspect-based sentiment triplet extraction (ASTE)
  • III) Hyper-pair Extraction, e.g.,

    • event extraction (EE)
    • semantic role labeling (SRL)
    • opinion role labeling (ORL)

Under this scheme, mention spans are described with <Span> terms and the corresponding <Span Attribute> labels; semantic relations are straightforwardly denoted with <Relation> labels.

And all the IE structures are cast into a sequential representation: Linearized Hierarchical Expression (LHE). For example,

  • in span extraction:

    • { ( Span1 , Attr1 ) , ... , ( Spani , Attri ) , ... }
  • in span extraction:

    • { ... , ( Spani , Attri [ Relk ] Spanj , Attrj ) , ... }
  • in span extraction:

    • { ... , ( Spani , Attri [ Relk ] Spanj , Attrj [ Relm ] Spann , Attrn , ... ) , ... }

1.2 UIE with Structure-aware Generative Language Model

As cast above, UIE has two key common challenges of IEs:

  • Boundary Identification of each span terms (for UIE-element-II: span extraction).

  • Long-range Dependence between different span terms (for UIE-element-I: relation extraction);

We thus propose addressing the two challenges by modeling both the syntactic dependency structure and constituency structure, where the constituency syntax mostly benefits the first challenge; the dependency structure well aids the second challenge. To implement the above idea, we propose learning a Latent Adaptive Structure-aware Generative Language Model for UIE, aka, LasUIE.

LasUIE has a three-stage learning procedure:

  • Stage-I: unsupervised generic pre-training:

    • generally using an off-the-shelf well-trained generative LM (GLM), e.g., BART, T5.
  • Stage-II: unsupervised structure-aware post-training:

    • a newly introduced procedure in this project, inserted between the pre-training and fine-tuning stages for structure learning.
  • Stage-III: supervised task-oriented structure fine-tuning:

    • a newly introduced procedure in this project, along with the task-specific finetuning.

1.2.1 Unsupervised structure-aware post-training

A Heterogeneous structure inductor (HSI) module is used to unsupervisedly enrich the backbone GLM with sufficient structural knowledge, reinforcing the awareness of linguistic syntax.

1.2.2 Supervised task-oriented structure fine-tuning

Further adjusting (finetune) the syntactic attributes within the GLM with stochastic policy gradient algorithm by directly taking the feedback of end task performance, such that the learned structural features are most coincident with the end task needs.


2. Code Usage

2.1 Requirement Installation

  • Step 1: install base envir

    conda create -n lasuie python=3.8
  • Step 2: install pytorch

     # CUDA 10.2
     conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=10.2 -c pytorch
    
     # CUDA 11.3
     conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
  • Step 3: install other requirements

    pip install -r requirements.txt

2.2 Code Structure

│---------------------------------------------------
├─config                           // configuration fold
│    ├─config.json                 // config for generic finetune
│    └─config_struct_tune.json     // config for structural finetune
│
├─data                             // data fold
│  ├─hyperpair                     // dataset for hyperpair extraction 
│  │  └─orl                        // task name
│  │      └─mpqa                   // dataset name
│  │          ├─labels.json        // template labels for hyperpair extraction 
│  │          ├─dev.json           // template dev set for hyperpair extraction 
│  │          ├─test.json          // template test set for hyperpair extraction 
│  │          └─train.json         // template train set for hyperpair extraction 
│  │  
│  ├─pair                           // dataset for pair extraction 
│  │  └─re  
│  │      └─nyt  
│  │          └─...
│  │  
│  ├─span                          // dataset for span extraction  
│  │  └─ner  
│  │      └─conll03  
│  │           └─...
│  │  
│  └─post-training                 // corpos for post-training of the GLM
│      ├─books-corpus  
│      └─wikipedia-en  
│---------------------------------------------------
├─checkpoint                       // saving model checkpoints
│    └─...
├─logs                             // saving experiment logs
│    └─...
├─test_output                      // saving testing/inference outputs
│    └─...
├─figures                          
├─requirements.txt                 
├─README.md             
├─LICENSE  
│---------------------------------------------------
├─engine                           // core codes here 
│    ├─constants.py
│    ├─cus_argument.py
│    ├─data_utils.py
│    ├─evaluating.py
│    ├─module.py
│    ├─t5_modeling.py
│    └─utils.py
│
├─run_struct_post_train.py          // entry of second phase of structural post-training
├─run_finetune.py                   // entry of thrid phase of generic fine-tuning
├─run_finetune_with_struct_tune.py  // entry of thrid phase of structural fine-tuning
├─run_inference.py                  // entry of fourth phase of inference 
â””---------------------------------------------------

2.3 Running Pipeline

The general pipeline goes as:

Step 1         run_struct_post_train.py  
                          ↓ 
Step 2            run_finetune.py (first train, then eval)
                          ↓
Step 3       run_finetune_with_struct_tune.py
                          ↓          
Step 4             run_inference.py

2.3.1 Structure-aware post-training

2.3.2 Supervised fine-tuning

A. task-oriented fine-tuning

  • Choosing to use ModelType.UIE or ModelType.LASUIE (in engine.constants.py) as the model type. ModelType.LASUIE model is much time-consuming than ModelType.UIE.

  • Configurate correctly all the arguments in run_finetune.py#init_args() and the config.json file.

  • Run starting finetune

    python run_finetune.py

B. structure fine-tuning

  • Choosing ModelType.LASUIE_STRUCT_TUNING (in engine.constants.py) as the backbone model.

  • Configurate config_struct_tune.json

  • Run starting structure-finetune

    python run_finetune_with_struct_tune.py
  • Notes: runing run_finetune_with_struct_tune.py is time-consuming.

    • structural fine-tuning is optional, can use the generic fine-tuning (run_finetune.py) instead.
    • recommended GPU requirement: >2* A100 (80G) GPUs.
  • Notes: making sure B. structural-tuning happens after A. generic fine-tuning, because hard-start structural-tuning leads to non-convergence.

2.3.3 Inference

  • Configurate correctly the argument model_checkpoint with the well-trained model.

  • Run starting inference

    python run_inference.py
  • The outputs of predictions will be converted to the UIE structures, and be saved in test_output fold.

2.4 Dataset & Evaluating

2.4.1 Dataset

  • Prepare your own data in the template format of data/hyperpair, data/pair or data/span.

  • Configurate the config/config.json and config/config_struct_tune.json before runing the scripts.

2.4.1 Evaluating

  • During training (tuning), the monitoring metric is rouge, as it is a text generation process.

    • Only enable F1 metric monitoring when model produce stable predictions.
  • The evaluation is based on exact match of spans and triplets, feel free to customize the evaluation metrics in engine/evaluating.py.


3 MISC

3.1 Citation

If you use this work or code, please kindly cite:

@inproceedings{fei2022lasuie,
  author = {Fei, Hao and Wu, Shengqiong and Li, Jingye and Li, Bobo and Li, Fei and Qin, Libo and Zhang, Meishan and Zhang, Min and Chua, Tat-Seng},
  booktitle = {Advances in Neural Information Processing Systems},
  title = {LasUIE: Unifying Information Extraction with Latent Adaptive Structure-aware Generative Language Model},
  url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/63943ee9fe347f3d95892cf87d9a42e6-Paper-Conference.pdf},
  pages = {15460--15475},
  year = {2022}
}

3.2 Acknowledgement

This code is partially referred from following projects or papers: UIE; Structformer, Huggingface-T5.

3.3 License

The code is released under Apache License 2.0 for Noncommercial use only. Any commercial use should get formal permission first from authors.

3.4 Contact

For any question or issue, please contact @Hao Fei and @Shengqiong Wu.

About

Universal Information Extraction, codes for the NeurIPS-2022 paper: Unifying Information Extraction with Latent Adaptive Structure-aware Generative Language Model.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages