Unsupervised Single Document Abstractive Summarization using Semantic Units

This is the source code of our AACL'2022 paper Unsupervised Single Document Abstractive Summarization using Semantic Units (paper link).

Environment

Our code requires the settings below:


Operation system	`Ubuntu 18.04+`
`Python` version	`3.6.9+`
CUDA version	`cuda11.2`
Packages	`sum_dist/requirements.txt`

Installation

Download this repo

git clone git@github.com:IKMLab/UASSU.git
# or
git clone https://github.com/IKMLab/UASSU.git

Install packages

cd UASSU
pip install -r requirements.txt
pip install git+https://github.com/huggingface/datasets
# If using CUDA11:
pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

And we need to install pyrouge for evaluation.

# We packed the steps into a script.
bash pyrouge_setup.sh

Successful installation of pyrouge will display the output like:

------------------------------------------------
Ran 10 tests in 4.482s

OK

Reference of potential problems when installing pyrouge:

We also have to setup spaCy.

pip install -U pip setuptools wheel
pip install -U spacy

# Install models for corresponding languages
# en (for CNN/DM, XSum, Wiki_en, ArXiv)
python -m spacy download en_core_web_sm
# de (for MLSUM_de)
python -m spacy download de_core_news_sm
# es (for MLSUM_es)
python -m spacy download it_core_news_sm
# ru (for MLSUM_ru)
python -m spacy download ru_core_news_sm

Data pre-processing

Pre-processed data (.pkl) are available at this link, and place the downloaded .pkl file at sum_dist/data/preprocess/

Or you can process data with the following scripts:

CNN/DM

python -m sum_dist.preprocess.preprocess \
-dataset cnndm \
-read_config sum_dist/exp_configs/config_preliminary_cnndm.json \
-tokenizer bert \
-save_dir ./sum_dist/data/preprocess \
-num_worker 10

XSum

python -m sum_dist.preprocess.preprocess \
-dataset xsum \
-read_config sum_dist/exp_configs/config_preliminary_xsum.json \
-tokenizer bert \
-save_dir ./sum_dist/data/preprocess \
-num_worker 10

MLSUM_de

python -m sum_dist.preprocess.preprocess \
-dataset mlsum_de \
-read_config sum_dist/exp_configs/config_preliminary_mlsum_de.json \
-tokenizer bert \
-save_dir ./sum_dist/data/preprocess \
-num_worker 10

MLSUM_es

python -m sum_dist.preprocess.preprocess \
-dataset mlsum_es \
-read_config sum_dist/exp_configs/config_preliminary_mlsum_es.json \
-tokenizer bert \
-save_dir ./sum_dist/data/preprocess \
-num_worker 10

MLSUM_ru

python -m sum_dist.preprocess.preprocess \
-dataset mlsum_ru \
-read_config sum_dist/exp_configs/config_preliminary_mlsum_ru.json \
-tokenizer bert \
-save_dir ./sum_dist/data/preprocess \
-num_worker 10

Wiki_en

python -m sum_dist.preprocess.preprocess \
-dataset wiki_en \
-read_config sum_dist/exp_configs/config_preliminary_wiki_en.json \
-tokenizer bert \
-save_dir ./sum_dist/data/preprocess \
-num_worker 10

ArXiv

python -m sum_dist.preprocess.preprocess \
-dataset arxiv \
-read_config ./sum_dist/exp_configs/config_preliminary_arxiv.json \
-tokenizer bert \
-save_dir ./sum_dist/data/preprocess \
-num_worker 5

Model training

Download link for trained checkpoints

bash scripts/train_cnndm_w5.sh

Inference

bash scripts/infer_cnndm_w5.sh

Evaluation

bash scripts/evaluate_cnndm_w5.sh

Datasets & Required Summary Length

For setting truncate_len during evaluation.

Dataset	Summary Length
CNN/DM	50
XSum	50
MLSUM_de	30
MLSUM_es	20
MLSUM_ru	15

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scripts		scripts
sum_dist		sum_dist
.gitignore		.gitignore
README.md		README.md
evaluate_rouge.py		evaluate_rouge.py
inference.py		inference.py
pyrouge_setup.sh		pyrouge_setup.sh
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsupervised Single Document Abstractive Summarization using Semantic Units

Environment

Installation

Data pre-processing

CNN/DM

XSum

MLSUM_de

MLSUM_es

MLSUM_ru

Wiki_en

ArXiv

Model training

Inference

Evaluation

Datasets & Required Summary Length

About

Releases

Packages

Languages

IKMLab/UASSU

Folders and files

Latest commit

History

Repository files navigation

Unsupervised Single Document Abstractive Summarization using Semantic Units

Environment

Installation

Data pre-processing

CNN/DM

XSum

MLSUM_de

MLSUM_es

MLSUM_ru

Wiki_en

ArXiv

Model training

Inference

Evaluation

Datasets & Required Summary Length

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages