Skip to content

Data Augmentation scripts for the parser MaChAmp-TWG as part of my bachelor thesis titled "Data Augmentation for TWG Parsing via Syntactically Well-formed Nonsense Sentences".

License

Notifications You must be signed in to change notification settings

m4cit/MaChAmp-TWG-Data-Augmentation

Repository files navigation

About

Data Augmentation scripts for the parser MaChAmp-TWG as part of my bachelor thesis titled "Data Augmentation for TWG Parsing via Syntactically Well-formed Nonsense Sentences".

Abstract of my Thesis

The future seems to be heavily focused on deep learning, and the field of computational linguistics plays a crucial role in developing capable parsing systems and analysing natural languages, which constantly evolve and change. One of the challenges with NLP (Natural Language Processing) is the amount of data needed for training. This is where data augmentation comes in. Expanding training data artificially can theoretically improve parsing models, and eliminate the need to source entirely new data.

This thesis investigates different methods of data augmentation for trainable TWG (Tree-Wrapping- Grammar) parsers and evaluates their validity as a means to improve parsers. The parser that was used is MaChAmp-TWG, a parser similar to Bladier et al. (2022) from the TreeGraSP (Evang et al., 2021) project, with RRGparbank gold data in English (Bladier et al., 2022).
The methods used are mainly based on random word replacements, and the idea is to expand the training dataset to see if syntactic validity is enough to obtain a larger training dataset and yield higher parsing accuracy.

The conclusion of this thesis is that all the presented methods produced parsing models with an increased F- score over the base model, the best performing model being a combination of both augmentation types (unchanged and nonsense sentences). The results suggest that semantic information plays a somewhat important role in parsing, and that the omission of semantics yields worse performance when compared to the models without nonsense sentence augmentations. However, the models with nonsense sentence augmentations still scored higher than the base model of MaChAmp-TWG.
Data augmentation for parsers seems to be a challenging way to improve performance. Although the tested methods / implementations delivered higher F-scores across the board without any fine-tuning or with just a little change to the training configuration, they did not contain flawless data. All the methods had inherent issues creating some invalid (augmented) data.

Scripts

1_unimorph_to_conllu.py

Translates the original UniMorph file into the language / format of the CoNLLU file.

2_improveUnimorph.py

Looks up all verbs from the UniMorph file successively on dictionary.com, and categorizes them into 'transitive' and 'intransitive' (Very slow. Better to use a dictionary API if available.).

3_improveRRG.py

Checks and adds the transitivity of all verbs to the RRG CoNLLU file.

4_filterForTrain.py

Filters out all unused words / lines in the RRG CoNLLU file.

generate.py

Replaces words in the RRG CoNLLU file with randomly chosen ones from the UniMorph file.

augment.py

Replaces original words in the training file with random ones generated by the module 'generate.py', followed by the augmentation of the training file with new sentences.

Requirements

  • Python 3.6 or newer
  • modules from the requirements.txt file

Installation

  1. pip install -r requirements.txt
    
  2. Place all files and directories into the main directory of MaChAmp-TWG, or use it without MaChAmp-TWG. The relevant files and directories are already available.

Input Parameters

Word Replacement Options

--unimorph0

UniMorph inaccurate verb replacements with regard to transitivity. In place of --unimorph1, --internal, --supertag, or --original.

--unimorph1

UniMorph accurate verb replacements with regard to transitivity. In place of --unimorph0, --internal, --supertag, or --original.

--internal

Internal word replacements. In place of --unimorph0, unimorph1, --supertag, or --original.

--supertag

Internal supertag word replacements. In place of --unimorph0, unimorph1, --internal, or --original.

--original

Augmentation with unchanged original sentences. In place of --unimorph0, unimorph1, --internal, or --supertag.

General Options

-h, --help

-i, --RRGinput

(OPTIONAL) Filtered RRG file input. Default file: "rrgparbank/conllu/filtered_acc_en_conllu.conllu".

-o, --RRGoutput

(OPTIONAL) Filtered RRG file output directory. Default directory: "rrgparbank/conllu".

-t, --tag

Word tags.

-ti, --trainInput

(OPTIONAL) train.supertags file input. Default file: "experiments/rrgparbank-en/base/train.supertags".

-to, --trainOutput

(OPTIONAL) train.supertags file output directory. If the directory is not specified, the default directory is used and filename changes to "new_train.supertags".

-s, --extensionSize

Extension size of the resulting training file. Must be >= 2. "2" doubles the size (sentences) of the base training file, thus does 1 run through the file (-s input-1).

Available tags (--tag) for replacement task (not for --supertag)

nS: Noun Singular
nP: Noun Plural

aPoss: Adjective Possessive
aCmpr: Adjective Comparative
aSup: Adjective Superlative

vPst: Verb Past Tense
vPresPart: Verb Present Tense, Participle Form
vPstPart: Verb Past Tense, Participle Form

adv (for --internal only): Adverb
advInt (for --internal only): Adverb, Pronominal type: Interrogative
advSup (for --internal only): Adverb Superlative
advCmpr (for --internal only): Adverb Comparative

noun: All nouns
adj: All adjectives
verb: All verbs
all: All available tags

Usage

augment.py [-h] [--unimorph0] [--unimorph1] [--internal] [--supertag] [--original]
[-i RRGINPUT] [-o RRGOUTPUT] [-t TAG] [-ti TRAININPUT] [-to TRAINOUTPUT] -s EXTENSIONSIZE

Example 1:

python augment.py --unimorph0 --tag all --extensionSize 2

or

python augment.py --unimorph0 -t all -s 2



Example 2:

python augment.py --supertag --extensionSize 10

or

python augment.py --supertag -s 10

Sources

Tatiana Bladier, Kilian Evang, Valeria Generalova, Zahra Ghane, Laura Kallmeyer, Robin Möllemann, Natalia Moors, Rainer Osswald, and Simon Petitjean. 2022. RRGparbank: A Parallel Role and Reference Grammar Treebank. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4833–4841, Marseille, France. European Language Resources Association.

Kilian Evang, Tatiana Bladier, Laura Kallmeyer, and Simon Petitjean. 2021. Bootstrapping Role and Reference Grammar Treebanks via Universal Dependencies. In Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021), pages 30–48, Sofia, Bulgaria. Association for Computational Linguistics.

Tatiana Bladier, Jakub Waszczuk, and Laura Kallmeyer. 2020. Statistical Parsing of Tree Wrapping Grammars. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6759–6766, Barcelona, Spain (Online). International Committee on Computational Linguistics.

Kallmeyer, L., Osswald, R., Van Valin, R.D. 2013. Tree Wrapping for Role and Reference Grammar. In: Morrill, G., Nederhof, MJ. (eds) Formal Grammar. FG FG 2013 2012. Lecture Notes in Computer Science, vol 8036. Springer, Berlin, Heidelberg.

UniMorph

About

Data Augmentation scripts for the parser MaChAmp-TWG as part of my bachelor thesis titled "Data Augmentation for TWG Parsing via Syntactically Well-formed Nonsense Sentences".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages