Data Augmentation scripts for the parser MaChAmp-TWG as part of my bachelor thesis titled "Data Augmentation for TWG Parsing via Syntactically Well-formed Nonsense Sentences".
The future seems to be heavily focused on deep learning, and the field of computational linguistics plays a
crucial role in developing capable parsing systems and analysing natural languages, which constantly evolve
and change. One of the challenges with NLP (Natural Language Processing) is the amount of data needed for
training. This is where data augmentation comes in. Expanding training data artificially can theoretically
improve parsing models, and eliminate the need to source entirely new data.
This thesis investigates different methods of data augmentation for trainable TWG (Tree-Wrapping-
Grammar) parsers and evaluates their validity as a means to improve parsers. The parser that was used is
MaChAmp-TWG, a parser similar to Bladier et al. (2022) from the TreeGraSP (Evang et al., 2021) project,
with RRGparbank gold data in English (Bladier et al., 2022).
The methods used are mainly based on random word replacements, and the idea is to expand the training
dataset to see if syntactic validity is enough to obtain a larger training dataset and yield higher parsing
accuracy.
The conclusion of this thesis is that all the presented methods produced parsing models with an increased F-
score over the base model, the best performing model being a combination of both augmentation types
(unchanged and nonsense sentences). The results suggest that semantic information plays a somewhat
important role in parsing, and that the omission of semantics yields worse performance when compared to
the models without nonsense sentence augmentations. However, the models with nonsense sentence
augmentations still scored higher than the base model of MaChAmp-TWG.
Data augmentation for parsers seems to be a challenging way to improve performance. Although the tested
methods / implementations delivered higher F-scores across the board without any fine-tuning or with just a
little change to the training configuration, they did not contain flawless data. All the methods had inherent
issues creating some invalid (augmented) data.
Translates the original UniMorph file into the language / format of the CoNLLU file.
Looks up all verbs from the UniMorph file successively on dictionary.com, and categorizes them into 'transitive' and 'intransitive' (Very slow. Better to use a dictionary API if available.).
Checks and adds the transitivity of all verbs to the RRG CoNLLU file.
Filters out all unused words / lines in the RRG CoNLLU file.
Replaces words in the RRG CoNLLU file with randomly chosen ones from the UniMorph file.
Replaces original words in the training file with random ones generated by the module 'generate.py', followed by the augmentation of the training file with new sentences.
- Python 3.6 or newer
- modules from the requirements.txt file
-
pip install -r requirements.txt
- Place all files and directories into the main directory of MaChAmp-TWG, or use it without MaChAmp-TWG. The relevant files and directories are already available.
UniMorph inaccurate verb replacements with regard to transitivity. In place of --unimorph1, --internal, --supertag, or --original.
UniMorph accurate verb replacements with regard to transitivity. In place of --unimorph0, --internal, --supertag, or --original.
Internal word replacements. In place of --unimorph0, unimorph1, --supertag, or --original.
Internal supertag word replacements. In place of --unimorph0, unimorph1, --internal, or --original.
Augmentation with unchanged original sentences. In place of --unimorph0, unimorph1, --internal, or --supertag.
(OPTIONAL) Filtered RRG file input. Default file: "rrgparbank/conllu/filtered_acc_en_conllu.conllu".
(OPTIONAL) Filtered RRG file output directory. Default directory: "rrgparbank/conllu".
Word tags.
(OPTIONAL) train.supertags file input. Default file: "experiments/rrgparbank-en/base/train.supertags".
(OPTIONAL) train.supertags file output directory. If the directory is not specified, the default directory is used and filename changes to "new_train.supertags".
Extension size of the resulting training file. Must be >= 2. "2" doubles the size (sentences) of the base training file, thus does 1 run through the file (-s input-1).
nS: Noun Singular
nP: Noun Plural
aPoss: Adjective Possessive
aCmpr: Adjective Comparative
aSup: Adjective Superlative
vPst: Verb Past Tense
vPresPart: Verb Present Tense, Participle Form
vPstPart: Verb Past Tense, Participle Form
adv (for --internal only): Adverb
advInt (for --internal only): Adverb, Pronominal type: Interrogative
advSup (for --internal only): Adverb Superlative
advCmpr (for --internal only): Adverb Comparative
noun: All nouns
adj: All adjectives
verb: All verbs
all: All available tags
augment.py [-h] [--unimorph0] [--unimorph1] [--internal] [--supertag] [--original]
[-i RRGINPUT] [-o RRGOUTPUT] [-t TAG] [-ti TRAININPUT] [-to TRAINOUTPUT] -s EXTENSIONSIZE
Example 1:
python augment.py --unimorph0 --tag all --extensionSize 2
or
python augment.py --unimorph0 -t all -s 2
Example 2:
python augment.py --supertag --extensionSize 10
or
python augment.py --supertag -s 10
Tatiana Bladier, Kilian Evang, Valeria Generalova, Zahra Ghane, Laura Kallmeyer, Robin Möllemann, Natalia Moors, Rainer Osswald, and Simon Petitjean. 2022. RRGparbank: A Parallel Role and Reference Grammar Treebank. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4833–4841, Marseille, France. European Language Resources Association.
Kilian Evang, Tatiana Bladier, Laura Kallmeyer, and Simon Petitjean. 2021. Bootstrapping Role and Reference Grammar Treebanks via Universal Dependencies. In Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021), pages 30–48, Sofia, Bulgaria. Association for Computational Linguistics.
Tatiana Bladier, Jakub Waszczuk, and Laura Kallmeyer. 2020. Statistical Parsing of Tree Wrapping Grammars. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6759–6766, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Kallmeyer, L., Osswald, R., Van Valin, R.D. 2013. Tree Wrapping for Role and Reference Grammar. In: Morrill, G., Nederhof, MJ. (eds) Formal Grammar. FG FG 2013 2012. Lecture Notes in Computer Science, vol 8036. Springer, Berlin, Heidelberg.