-
Notifications
You must be signed in to change notification settings - Fork 9
Setup and file locations on the Saarland servers
-
cd
to your home directory. - Create a file
.condarc
with content
envs_dirs:
- /proj/irtg.shadow/conda/envs
- Further create a file
run_conda.sh
with content
# set UTF encoding right
export LC_ALL=en_US.UTF-8
# run conda
. /proj/contrib/linux/anaconda3/etc/profile.d/conda.sh
# for comet_ml
export https_proxy="http://www-proxy.uni-saarland.de:3128"
- Clone the repository.
Note that the prediction scripts mentioned in the quick guide of the main readme (as well as the training scripts) download large model files (1.5G) and the large am-tools jar (.5G). Thus, it is recommended to clone this repository to a /local/your_username/
directory and work there and not to your home directory.
-
cd
to your home directory and run
. run_conda.sh
conda activate allennlp
This activates the environment at /proj/irtg.shadow/conda/envs/allennlp
.
-
cd
to the cloned repository. You should now be good to run this parser. - To leave the environment, use
conda deactivate allennlp
.
The server only has Java 8 installed. This works with the automatically downloaded am-tools.jar. However, if you want to use a self-compiled am-tools.jar (e.g. from the new_decomposition branch), this will not work, since am-tools now relies on Java 11. So you will need to e.g. use a docker setup, where you can specify the Java version yourself.
One thing to note is that am-tools has a branch called java_8, which can be compiled with Java 8. At the time of this writing (November 2023), this version is up-to-date with the master branch of am-tools, except that it uses an older version of alto as a dependency.
The quick guide of the main readme has a documentation on how to reproduce our parsing results. The script requires as input the respective test set you want to parse (the -i
option); these files exist on the saarland servers, you may need to be a member of the irtg
group to access them. The file locations are:
Formalism | Test set | Path |
---|---|---|
DM | in-domain test set | /proj/irtg/sempardata/ACL2019/SemEval/2015/DM/test.id/en.id.dm.sdp |
DM | out-of-domain test set | /proj/irtg/sempardata/ACL2019/SemEval/2015/DM/test.ood/en.ood.dm.sdp |
PAS | in-domain test set | /proj/irtg/sempardata/ACL2019/SemEval/2015/PAS/test.id/en.id.pas.sdp |
PAS | out-of-domain test set | /proj/irtg/sempardata/ACL2019/SemEval/2015/PAS/test.ood/en.ood.pas.sdp |
PSD | in-domain test set | /proj/irtg/sempardata/ACL2019/SemEval/2015/PSD/test.id/en.id.psd.sdp |
PSD | out-of-domain test set | /proj/irtg/sempardata/ACL2019/SemEval/2015/PSD/test.ood/en.ood.psd.sdp |
EDS | test set | /proj/irtg/sempardata/ACL2019/EDS/original_data/test.amr |
AMR 2017 | test set | /proj/irtg/amrtagging/amrtagging/corpora/abstract_meaning_representation_amr_2.0/data/amrs/split/test/ |
AMR 2015 | test set* | /proj/irtg/amrtagging/cleanAMRData_02-2019/AMRBank/LDC2015E86/test/ |
*note that the current setup uses a model trained on AMR 2017 as well as lookup data from that corpus for postprocessing, so the results on this set are not comparable
You can find training and dev sets in similar locations, except apparantly a dev set for EDS (there may not be one? -- EDIT: use data_preparation/EDS_split_train_dev.py
).
To execute preprocessing you need the original corpora. These files exist on the Saarland servers, you may need to be a member of the irtg
group to access them. The file locations are:
Formalism | Path |
---|---|
DM | /proj/irtg/sempardata/sdp/2015/ |
PAS | /proj/irtg/sempardata/sdp/2015/ |
PSD | /proj/irtg/sempardata/sdp/2015/ |
EDS | /proj/irtg/amrtagging/SDP/EDS/raw_data/ |
AMR 2017 | /proj/corpora/abstract_meaning_representation_amr_2.0_LDC2017T10/abstract_meaning_representation_amr_2.0/data/amrs/split/ |
AMR 2015 | /proj/irtg/amrtagging/cleanAMRData_02-2019/AMRBank/LDC2015E86 |
If you just want to train a new model and want to skip preprocessing, you can find the preprocessed train and dev sets on the Saarland servers. You may need to be a member of the irtg
group to access them. The train.amconll
and dev.amconll
files of each formalism can be found in the train
and dev
directories of the following locations:
Formalism | Path |
---|---|
DM | /proj/irtg/sempardata/ACL2019/SemEval/2015/DM/ |
PAS | /proj/irtg/sempardata/ACL2019/SemEval/2015/PAS/ |
PSD | /proj/irtg/sempardata/ACL2019/SemEval/2015/PSD/ |
EDS | /proj/irtg/sempardata/ACL2019/EDS/ |
AMR 2017 | /proj/irtg/sempardata/ACL2019/AMR/2017/ |
AMR 2015 | /proj/irtg/sempardata/ACL2019/AMR/2015/ |
The MRP data is all contained in /proj/irtg/sempardata/mrp
.
- the subdirectory
LDC2019E45
contains the original data as distributed by the organizers, the companion data and a few preprocessed files, see theREADME.txt
inLDC2019E45
. - the subdirectory
eval
contains the test files + companion data. - the subdirectory
amconll
contains all files you need to train the parser.
Formalism | Name of decomposition | Training file | Comment |
---|---|---|---|
DM | - | /proj/irtg/sempardata/mrp/amconll/DM/train/train.amconll |
|
PSD | - | /proj/irtg/sempardata/mrp/amconll/PSD/train/train.amconll |
|
EDS | - | /proj/irtg/sempardata/mrp/amconll/EDS/train/train.amconll |
|
AMR | clean_decomp | /proj/irtg/sempardata/mrp/amconll/AMR/clean_decomp/train/train.amconll |
data used in submitted version, no extensive WordNet use, no CoreNLP |
AMR | after-mrp-wn-stanf | /proj/irtg/sempardata/mrp/amconll/AMR/after-mrp-wn-stanf/train/train.amconll |
version used in paper, labeled as "improved + WordNet/Stanford" |
UCCA | very_first | /proj/irtg/sempardata/mrp/amconll/UCCA/very_first/train/train.amconll |
submitted version |
UCCA | af_no_remote | /proj/irtg/sempardata/mrp/amconll/UCCA/af_no_remote/train/train.amconll |
"improved version" in paper, no remote edges |
UCCA | af_remote | /proj/irtg/sempardata/mrp/amconll/UCCA/af_remote/train/train.amconll |
not submitted, same as af_no_remote but remote edges kept |
All folders for a graphbank have a specific structure:
-
train
contains thetrain.amconll
-
dev
contains an empty amconll file (=only sentences, no AM dep. trees) and corresponds to the entire dev set. There is also the filedev.mrp
that contains the corresponding gold graphs. -
gold-dev
contains the AM dependency trees of the subset of the dev set that our heuristics could decompose. There is a corresponding file with gold graphs. - there is a folder called
test
that contains an empty amconll file with the test sentences.
In the case of UCCA, there is also another folder that contains the output from the preprocessing (in python) that is used as input to the decomposition.
- Groschwitz et al. 2018: Original files are now at
/local/jonasg/amrtagging/
onfalken-3
. @jgroschwitz also has a local backup copy.