This repository contains the official code and data for the ACL 2024 Findings paper Bilingual Rhetorical Structure Parsing with Large Parallel Annotations.
This repository focuses on data and experiments. For applying the trained parsers, visit the IsaNLP RST repository for models and usage instructions.
The data directory structure should be as follows:
data/
├── gum_rs3/
│ ├── en/
│ │ └── *.rs3
│ └── ru/
│ └── *_RU.rs3
├── rstdt_rs3/
│ ├── TEST/
│ │ └── wsj_*.rs3
│ └── TRAINING/
│ └── wsj_*.rs3
└── rurstb_rs3/
├── train.*_part_*.rs3
├── dev.*_part_*.rs3
└── test.*_part_*.rs3
- gum_rs3/ru/ Contains the RRG corpus in Russian.
data/RRG.zip
- gum_rs3/en/ Place the GUM RST *.rs3 files here. GUM dataset link.
- rstdt_rs3/ Place the RST-DT *.rs3 files here. RST-DT dataset link.
- rurstb_rs3/ Contains the RRT corpus; one document = one tree.
data/rurstb_rs3.zip
The train/dev/test splits for GUM/RRG are listed under data/gum_file_lists
for GUM v9.1. If you are using a later extended version, you should update these file lists accordingly.
Set WANDB_KEY
in dmrst_parser/keys.py
for online wandb support.
-
Train:
python dmrst_parser/multiple_runs.py --corpus "$CORPUS" --lang "$LANG" --model_type "$TYPE" --cuda_device 0 train
-
Evaluate:
python dmrst_parser/multiple_runs.py --corpus "$CORPUS" --lang "$LANG" --model_type "$TYPE" --cuda_device 0 evaluate
-
Train:
python dmrst_parser/multiple_runs.py --corpus 'GUM' --lang "$LANG" --model_type "$TYPE" train_mixed --mixed 100
-
Evaluate:
python utils/eval_dmrst_transfer.py --models_dir saves/path-with-models \ --corpus 'GUM' --lang "$LANG2" --nfolds 5 evaluate
LANG
:en
,ru
CORPUS
:RST-DT
,GUM
(RRG withlang=ru
),RuRSTB
(RRT)TYPE
:default
,+tony
,+tony+bilstm_edus