Setup and file locations on the Saarland servers

The conda environment

First-time setup

cd to your home directory.
Create a file .condarc with content

envs_dirs:
  - /proj/irtg.shadow/conda/envs

Further create a file run_conda.sh with content

# set UTF encoding right
export LC_ALL=en_US.UTF-8

# run conda
. /proj/contrib/linux/anaconda3/etc/profile.d/conda.sh

# for comet_ml
export https_proxy="http://www-proxy.uni-saarland.de:3128"

Clone the repository.

Note that the prediction scripts mentioned in the quick guide of the main readme (as well as the training scripts) download large model files (1.5G) and the large am-tools jar (.5G). Thus, it is recommended to clone this repository to a /local/your_username/ directory and work there and not to your home directory.

Whenever you log into the server

cd to your home directory and run

. run_conda.sh
conda activate allennlp

This activates the environment at /proj/irtg.shadow/conda/envs/allennlp.

cd to the cloned repository. You should now be good to run this parser.
To leave the environment, use conda deactivate allennlp.

Java version

The server only has Java 8 installed. This works with the automatically downloaded am-tools.jar. However, if you want to use a self-compiled am-tools.jar (e.g. from the new_decomposition branch), this will not work, since am-tools now relies on Java 11. So you will need to e.g. use a docker setup, where you can specify the Java version yourself.

One thing to note is that am-tools has a branch called java_8, which can be compiled with Java 8. At the time of this writing (November 2023), this version is up-to-date with the master branch of am-tools, except that it uses an older version of alto as a dependency.

File locations

Test sets for reproduction

The quick guide of the main readme has a documentation on how to reproduce our parsing results. The script requires as input the respective test set you want to parse (the -i option); these files exist on the saarland servers, you may need to be a member of the irtg group to access them. The file locations are:

Formalism	Test set	Path
DM	in-domain test set	`/proj/irtg/sempardata/ACL2019/SemEval/2015/DM/test.id/en.id.dm.sdp`
DM	out-of-domain test set	`/proj/irtg/sempardata/ACL2019/SemEval/2015/DM/test.ood/en.ood.dm.sdp`
PAS	in-domain test set	`/proj/irtg/sempardata/ACL2019/SemEval/2015/PAS/test.id/en.id.pas.sdp`
PAS	out-of-domain test set	`/proj/irtg/sempardata/ACL2019/SemEval/2015/PAS/test.ood/en.ood.pas.sdp`
PSD	in-domain test set	`/proj/irtg/sempardata/ACL2019/SemEval/2015/PSD/test.id/en.id.psd.sdp`
PSD	out-of-domain test set	`/proj/irtg/sempardata/ACL2019/SemEval/2015/PSD/test.ood/en.ood.psd.sdp`
EDS	test set	`/proj/irtg/sempardata/ACL2019/EDS/original_data/test.amr`
AMR 2017	test set	`/proj/irtg/amrtagging/amrtagging/corpora/abstract_meaning_representation_amr_2.0/data/amrs/split/test/`
AMR 2015	test set*	`/proj/irtg/amrtagging/cleanAMRData_02-2019/AMRBank/LDC2015E86/test/`

^{*note that the current setup uses a model trained on AMR 2017 as well as lookup data from that corpus for postprocessing, so the results on this set are not comparable}

You can find training and dev sets in similar locations, except apparantly a dev set for EDS (there may not be one? -- EDIT: use data_preparation/EDS_split_train_dev.py).

Raw corpora

To execute preprocessing you need the original corpora. These files exist on the Saarland servers, you may need to be a member of the irtg group to access them. The file locations are:

Formalism	Path
DM	`/proj/irtg/sempardata/sdp/2015/`
PAS	`/proj/irtg/sempardata/sdp/2015/`
PSD	`/proj/irtg/sempardata/sdp/2015/`
EDS	`/proj/irtg/amrtagging/SDP/EDS/raw_data/`
AMR 2017	`/proj/corpora/abstract_meaning_representation_amr_2.0_LDC2017T10/abstract_meaning_representation_amr_2.0/data/amrs/split/`
AMR 2015	`/proj/irtg/amrtagging/cleanAMRData_02-2019/AMRBank/LDC2015E86`

Preprocessed corpora

If you just want to train a new model and want to skip preprocessing, you can find the preprocessed train and dev sets on the Saarland servers. You may need to be a member of the irtg group to access them. The train.amconll and dev.amconll files of each formalism can be found in the train and dev directories of the following locations:

Formalism	Path
DM	`/proj/irtg/sempardata/ACL2019/SemEval/2015/DM/`
PAS	`/proj/irtg/sempardata/ACL2019/SemEval/2015/PAS/`
PSD	`/proj/irtg/sempardata/ACL2019/SemEval/2015/PSD/`
EDS	`/proj/irtg/sempardata/ACL2019/EDS/`
AMR 2017	`/proj/irtg/sempardata/ACL2019/AMR/2017/`
AMR 2015	`/proj/irtg/sempardata/ACL2019/AMR/2015/`

MRP shared task file locations

The MRP data is all contained in /proj/irtg/sempardata/mrp.

the subdirectory LDC2019E45 contains the original data as distributed by the organizers, the companion data and a few preprocessed files, see the README.txt in LDC2019E45.
the subdirectory eval contains the test files + companion data.
the subdirectory amconll contains all files you need to train the parser.

Formalism	Name of decomposition	Training file	Comment
DM	-	`/proj/irtg/sempardata/mrp/amconll/DM/train/train.amconll`
PSD	-	`/proj/irtg/sempardata/mrp/amconll/PSD/train/train.amconll`
EDS	-	`/proj/irtg/sempardata/mrp/amconll/EDS/train/train.amconll`
AMR	clean_decomp	`/proj/irtg/sempardata/mrp/amconll/AMR/clean_decomp/train/train.amconll`	data used in submitted version, no extensive WordNet use, no CoreNLP
AMR	after-mrp-wn-stanf	`/proj/irtg/sempardata/mrp/amconll/AMR/after-mrp-wn-stanf/train/train.amconll`	version used in paper, labeled as "improved + WordNet/Stanford"
UCCA	very_first	`/proj/irtg/sempardata/mrp/amconll/UCCA/very_first/train/train.amconll`	submitted version
UCCA	af_no_remote	`/proj/irtg/sempardata/mrp/amconll/UCCA/af_no_remote/train/train.amconll`	"improved version" in paper, no remote edges
UCCA	af_remote	`/proj/irtg/sempardata/mrp/amconll/UCCA/af_remote/train/train.amconll`	not submitted, same as af_no_remote but remote edges kept

All folders for a graphbank have a specific structure:

train contains the train.amconll
dev contains an empty amconll file (=only sentences, no AM dep. trees) and corresponds to the entire dev set. There is also the file dev.mrp that contains the corresponding gold graphs.
gold-dev contains the AM dependency trees of the subset of the dev set that our heuristics could decompose. There is a corresponding file with gold graphs.
there is a folder called test that contains an empty amconll file with the test sentences.

In the case of UCCA, there is also another folder that contains the output from the preprocessing (in python) that is used as input to the decomposition.

Files used in Papers

Groschwitz et al. 2018: Original files are now at /local/jonasg/amrtagging/ on falken-3. @jgroschwitz also has a local backup copy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly