This repository contains the code and the data from the paper: Empirical Error Modeling Improves Robustness of Noisy Neural Sequence Labeling that was accepted to appear in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. This work extends the Noise-Aware Training (NAT) framework.
NAT is a method that utilizes both the original and the perturbed input for training of the sequence labeling models. It improves accuracy on the data from noisy sources such as user-generated text or text produced by the Optical Character Recognition (OCR) process.
At training time, the original NAT method uses a vanilla synthetic error model to induce the noise into the error-free sentences. Although utilizing randomized error patterns during training often provided moderate improvements, we argue that the empirical error modeling would be significantly more beneficial. Our extension implements a data-driven error generator based on the sequence-to-sequence modeling paradigm. Our models were trained to solve the monotone string translation problem, akin to the error correction setting but in the opposite direction, i.e., we used an error-free sentence as input and trained the models to produce an erroneous text. Please refer to our paper for a more detailed explanation of our approach.
Moreover, to better imitate the real-world application scenario, we generated a set of noisy data sets by applying an OCR engine to the sentences extracted from the original sequence labeling benchmarks. Our code could be readily used to produce noisy version of other sequence labeling data sets. We encourage the user to experiment with this functionality. We hope that our work will facilitate the future research on robustness in Natural Language Processing.
The structure of the project reminds that from the original NAT framework with some differences (marked as [new] in the diagram below):
├── flair [extern]
├── flair_ext
│ ├── models
│ ├── trainers
│ └── visual
├── natas [extern]
├── onmt [extern]
├── pysia [new]
├── resources
│ ├── cmx
│ ├── conversion [new]
│ ├── corpora [new]
│ ├── dictionaries [new]
│ ├── fonts [new]
│ ├── language_models [new]
│ ├── taggers
│ ├── tasks
│ └── typos
├── results
├── robust_ner
├── scripts [new]
└── trdg [extern]
natas: contains the NATAS library for OCR post-correction that we used as a baseline for comparison with our method.
onmt: includes the ONMT toolkit that we utilized to train our sequence-to-sequence error generators and the error correction model employed by NATAS.
pysia: contains the code of our Python Sentence Inter-Alignment (PySIA) toolkit. It constitutes the core part of our contribution and contains method for sentence alignment, training data preparation, and wrapper methods for the sequence-to-sequence training.
scripts: includes a slightly modified version of the punctuation normalization script from the 1 Billion Word Language Model Benchmark that we used in our experiments.
trdg: contains the Text Recognition Data Generator (TRDG) toolkit employed for text rendering.
Moreover, we extended the basic NAT framework by implementing our error generation methods and included it in the extended sequence labeling model. Additionally, we modified the trainer class and extended the former NAT functionality contained in robust_ner.
Furthermore, we added additional data into the resources directory including dictionaries extracted from the test sets of the sequence labeling benchmarks that were used by the error correction methods, fonts that were utilized by the text rendering module, and edit operations and checksums required to recreate and validate the noisy sequence labeling data sets used in our experiments (cf. conversion).
Note that FLAIR, NATAS, ONMT and TRDG are not included in this repository. See the Quick Start section for more information about the installation of the additional dependencies.
Please refer to the description of the remaining components here.
- Please install the python packages as shown below:
pip install -r requirements.txt
- To use Hunspell, you need to install the required packages:
sudo apt-get install hunspell hunspell-en-us libhunspell-dev python-dev
pip install hunspell
To use the OCR functionality, please refer to the requirements of the tesserocr package.
If you experience problems installing Matplotlib, make sure that the FreeType library has been properly installed:
sudo apt-get install python-dev libfreetype6-dev
If you get a following error: ModuleNotFoundError: No module named 'tkinter'
at the end of the sequence-to-sequence model training then run the following command:
sudo apt-get install python3-tk
Please download the following projects and move them to the working directory as described below:
- Download the FLAIR framework (v0.5) from here, unzip it, rename it the flair-0.5 directory to flair, and move it to the working directory.
- Download the NATAS library (v1.0.5) from here, unzip it, rename it to natas, and move it to the working directory.
- Download the ONMT toolkit (v1.1.1) from here, unzip it, rename it to onmt, and move it to the working directory.
- Download the TRDG toolkit (v1.6.0) from here, unzip it, rename it to trdg, and move it to the working directory.
Please follow the instruction on these websites to get the original data:
CoNLL 2003:
UD English EWT:
- 1 Billion Word Language Model Benchmark:
The python script can be used to reproduce our experiments. In this section, we present the command-line arguments and their usage.
In addition to the original configuration, we introduce, or modify/extend the following parameters (empasized bold values in the table below):
Parameter | Description | Value |
--mode | Execution mode | One of: train, train_lm, tune, eval, sent_gen, sent_gen_txt, noisy_crp, onmt, ds_restore, ds_check. |
--corpus | Data set to use | One of: conll03_en (default), conll03_de, germeval, ud_en, conll03_en_tess3_01, conll03_en_tess4_01, conll03_en_tess4_02, conll03_en_tess4_03, conll03_en_typos, ud_en_tess3_01, ud_en_tess4_01, ud_en_tess4_02, ud_en_tess4_03, ud_en_typos. |
--text_corpus | Text corpus path | The name of a text corpus (default: empty) |
--train_mode | Training mode | One of: combined (default), not-specified. |
--alpha | Weight of the data augmentation objective | Floating point value (default: 0.0) |
--beta | Weight of the stability training objective | Floating point value (default: 0.0) |
--type | Type of embeddings | One of: flair+glove (default), flair+wiki, flair, bert, elmo, glove+char, wiki+char, myflair+glove, myflair. |
--typos_file | File containing look-up table with typos. | e.g.: en.natural, moe_misspellings_train.tsv. |
--correction_module | Spell- or OCR post-correction module | One of: not-specified (default), hunspell, natas. |
--errgen_model | Path to the trained sequence-to-sequence error generator | File path (default: empty). |
--errgen_mode | Error generation mode | One of: errgen_tok, errgen_ch, errcorr_tok. |
--errgen_temp | Sampling temperature | Floating point value (default: 1.0). |
--errgen_topk | Top-K sampling candidates to use | Integer value (default: -1). |
--errgen_nbest | N-best beams to use | Integer value (default: 5). |
--errgen_beam_size | Beam size to use | Integer value (default: 10). |
--seek_file | File name to start with when generating paired data | File name (default: empty). |
--seek_line | Line number to seek when generating paired data | Integer value (default: 0). |
--storage_mode | Embedding storage mode | One of: auto (default), gpu, cpu, none. |
--use_amp | Use mixed-precision training | No parameters, turned off by default. |
-h | Print help | No parameters. |
--lm_type | Type of the language model to use | One of: forward, backward |
--num_layers | The number of network layers | Integer value (default: 1) |
--patience | The number of epochs with no improvement until annealing the learning rate | Used only in the case of LM-training. Integer value (default: 50) |
--anneal_factor | The factor by which the learning rate is annealed | Used only in the case of LM-training. Integer value (default: 0.25) |
--sequence_length | Truncated BPTT window length | Used only in the case of LM-training. Integer value (default: 250) |
The basic command-line calls can be found here. In addition, we present how to use the functionality related to our approach.
Assuming that the original <text_corpus> is stored in resources/corpora/<text_corpus>, the following call will run the parallel data generation procedure. The results will be stored in results/generated/<text_corpus> afterwards.
python3 --mode sent_gen_txt --text_corpus <text_corpus>
Similarly, the following command will read the sentences from the sequence labeling data set <seq_lab_corpus> in resources/task/<seq_lab_corpus>, render them, and store the results in results/generated/<seq_lab_corpus>.
python3 --mode sent_gen --corpus <seq_lab_corpus>
We can move the resultant <seq_lab_corpus> to the results/tasks directory, so we can use it for evaluation or training.
The parallel data set needs to be normalized prior to using it further. To this end, we apply the normalization script as follows:
scripts/ <text_corpus>
As before, we assume that the <text_corpus> is located under results/generated/<text_corpus>.
To restore the noisy data sets used in our experiments, we execute the following command:
python3 --mode ds_restore --corpus <seq_lab_corpus>
where the <seq_lab_corpus> is either conll03_en or ud_en. Our scripts will then recreate the underlying data based on the sequences of edit operations stored in resources/conversion/<seq_lab_corpus>/<train/dev/test>_ops.txt.
Moreover, we can validate the generated data using the checksums distributed with our library by running:
python3 --mode ds_check --corpus <seq_lab_corpus>
The resultant train/test/dev splits with the _restored suffix can be copied to the resources/task folder to be employed, e.g., for evaluation or training. For example, we can use the following command to copy the restored conll03_en_tess4_01 data set:
rsync -a resources/conversion/conll03_en_tess4_01/*_restored* resources/tasks/conll03_en_tess4_01/
The normalized parallel data can be utilized to train a sequence-to-sequence error generation or correction model. The following command will start the procedure that splits the parallel data, converts it to the format used by ONMT and starts the training of the model:
python3 --mode onmt --text_corpus <text_corpus>
The results will be stored in the results/generated/<text_corpus>/<model_name> directory, where <model_name> is constructed as follows:
where mode is one of the following: errgen_tok, errgen_ch, or errcorr_tok, and size is the size of the parallel data set (100, 1k, 10k, 100k, 1M, 10M), e.g., model_errgen_tok_100k.
By default, the token-level error model will be trained. To change this behavior, e.g., to train the error correction model, we need to modify the code in the onmt() function in by uncommenting the line that corresponds to the option that we need, e.g.:
# mode = Seq2SeqMode.ErrorGenerationCh
# mode = Seq2SeqMode.ErrorGenerationTok
mode = Seq2SeqMode.ErrorCorrectionTok
Having the trained error generation model, we can utilize it to train a downstream sequence labeling model <model_name> using NAT technique. The following command will start the stability training of a <model_name> on the English CoNLL 2003 training data using the error generator model stored in results/generated/<text_corpus>/model_errgen_tok_100k/<text_corpus>, the token-to-token generation mode, the sampling temperature 1.1, and top-k = 10 best candidates.
python3 --mode train --model <model_name> --corpus conll03_en --type flair+glove --errgen_model results/generated/<text_corpus>/model_errgen_tok_100k/<text_corpus> --errgen_mode errgen_tok --errgen_temp 1.1 --errgen_topk 10 --beta 1.0
The models will be stored in the resources/taggers directory.
To extract the data for NLM training, we can use the following command:
python3 --mode noisy_crp --text_corpus <text_corpus>
where the <text_corpus> refers to the data stored in results/generated/<text_corpus>. As a result, it will create two sub-directories: <text_corpus>pairs_norm_org_<max_lines> and <text_corpus>_pairs_norm_rec_<max_lines>, where the former and the latter will contain the clean- and the noisy-part of the parallel text corpus, respectively. The <max_lines> parameter refers to the maximum number of lines that need to be extracted from the source text file and is unbounded by default, but it can be adjusted in the code if necessary.
To utilize the generated corpora for NLM training, you need to copy them to the resources/corpora directory.
Previously extracted noisy corpus could be used as the source of text for NLM training. Except for the source of textual input, the NLM training follows the standard routines of the FLAIR library. For reference, please refer to the instructions on how to prepare the data for the language model training. Subsequently, the NLM training can be performed using NAT framework with the following call:
python3 --mode train_lm --text_corpus <lm_text_corpus> --lm_type <lm_type> --model custom_<lm_type>
where <lm_text_corpus> is the text corpus prepared for LM training located in the resources/corpora directory and <lm_type> refers to the type of a LM to be trained (either forward or backward). The results of this call will be stored in the resources/language_models/custom_<lm_type> directory.
Our pre-trained NLM embeddings can be found here: custom_forward and custom_backward. Both directories (custom_forward and custom_backward) need to be placed in the resources/language_models/ directory.
Previously trained NLM embeddings can be used to train a NAT model as follows:
python3 --mode train --corpus <data_set> --model <model_name> --type <embeddings_type>
where <embeddings_type> refers to the type of embeddings to be employed - myflair and myflair+glove values are in-built aliases for the custom flair embedding models. Please refer to the init_embeddings() function in for further details.
Finally, we can utilize a specific error correction module for evaluation in the following way:
python3 --mode eval --model model_name --corpus conll03_en_tess4_01 --col_idx 2 --text_idx 1 --correction_module hunspell
Additional remarks: conll03_en_tess4_01 is a noisy data set generated using our approach and derived from the original English CoNLL 2003 benchmark. To utilize it we need to specify two additional parameters: --col_idx and --text_idx that represent the column index of the class labels and the text column, respectively. The first column in the generated noisy data sets always corresponds to the possibly erroneous text and the second column contains the error-free tokens. In the example above, we will use the noisy tokens.
To use the NATAS module, please download the spacy model using the following command:
python -m spacy download en_core_web_md
You need to manually set the path to your trained sequence-to-sequence correction model as the default value for the parameter model_path in the function correct_text_with_natas() in, e.g.:
def correct_text_with_natas(input, ext_dictionary=None, model_path="results/generated/1bilion/model_errcorr_tok_1M/", verbose=False):
Subsequently you can use your correction model as follows:
python3 --mode eval --model model_name --corpus conll03_en_tess4_01 --col_idx 2 --text_idx 1 --correction_module natas
Please cite our paper when using the code:
title = "Empirical Error Modeling Improves Robustness of Noisy Neural Sequence Labeling",
author = {Namysl, Marcin and Behnke, Sven and K{\"o}hler, Joachim},
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "",
pages = "314--329",
- Marcin Namysl dblp, orcid, Google Scholar, Semantic Scholar
This project is licensed under the MIT License - see the LICENSE file for details