- Contains also the EU documents
- Each document has 66 sentences of
~2000
words - The
*.chnk
files contain each data point on a line with sentences split by '#%' - The original indices before shuffling are stored in
*.orig_idx
- Shuffled documents are found in
*.shf
Each test set contains 40 documents from each class, and the training sets have a maximum of 197 documents (the minimum nr given Scotland, the smallest class)
Install requirements. Spacy is only needed for to obtain PoS tags.
pip3 install requirements.txt
Download spacy models for English and French.
# you may choose to download other models if you wish so
python3 -m spacy download en_core_web_trf
python3 -m spacy download fr_dep_news_trf # for PoS tagging
python3 -m spacy download fr_core_news_lg # for NER
This is not needed, as the splits are already released in this repository. If one wishes to split the files into chunks and to do the train-test split, run:
python3 split_extract.py
We used an En-Fr readily-available MT model implemented in fairseq-py. The fairseq model is available for download and may be loaded into fairseq by following the tutorial from the official repository. The MT model is only important to show that dialect-markers from the source language are preserved in the MT output.
The annotated train-test splits used in the experiments are available here. This script will generate PoS-tagged copies of the text files and copies that have all the annotated entities replaced. It is intended to be used with the English, French, and MT French train-test splits. Check the source code, if you wish to run it on a custom directory.
cd src && python3 make_pos_dirs.py
This will run all the configurations in the *_classifications.py
scripts. Currently it does classifications using function words, pronouns, PoS n-grams, word n-grams.
cd src && \
python3 en_classifications.py && \
python3 fr_classifications.py
The classification scores are obtained by doing multiple experiments on downsampled data. Each class is split into batches of equal size and for each split, we train-test a classifier.
- corpus/* : directories of texts produced by native, non-native MEP's
- corpus/train_test_split: the train test splits for comparing classification similarities across languages
- analogies: csv files of misaligned words across corpora
- feature_selection: directory, we saved the accuracy, F1 score and cofusion matrices for different classification scenarios, filenames: FEATURE_LANGUAGE_PAIRS.ftrs
- features: lists of word and PoS features
- src: source file directory
Maximum nr of sents / doc 66 Scotland chunks: 237 England chunks: 911 Ireland chunks: 476 EU chunks: 449
setup | feature | En vs. Ir | En vs. Sc | Ir vs. Sc | 3-way |
---|---|---|---|---|---|
English Originals | function_words_en | 0.9 | 0.91 | 0.85 | 0.8 |
English Originals | pronouns_en | 0.63 | 0.76 | 0.69 | 0.57 |
English Originals | PoS n-grams | 0.91 | 0.87 | 0.91 | 0.83 |
English Originals | selected_pos_ngrams_en | 0.88 | 0.85 | 0.86 | 0.78 |
English Originals | selected_pos_ngrams_fr | 0.82 | 0.71 | 0.77 | 0.64 |
English no Entities | Word n-grams | 0.91 | 0.89 | 0.92 | 0.83 |
setup | feature | En vs. Ir | En vs. Sc | Ir vs. Sc | 3-way |
---|---|---|---|---|---|
French Translations | function_words_fr | 0.84 | 0.87 | 0.78 | 0.71 |
French Translations | pronouns_fr | 0.82 | 0.8 | 0.72 | 0.66 |
French Translations | PoS n-grams | 0.89 | 0.82 | 0.76 | 0.74 |
French Translations | selected_pos_ngrams_fr | 0.78 | 0.76 | 0.62 | 0.59 |
French Translations | selected_pos_ngrams_en | 0.8 | 0.76 | 0.71 | 0.59 |
French no Entities | Word n-grams | 0.97 | 0.91 | 0.95 | 0.9 |
French MT | function_words_fr | 0.88 | 0.84 | 0.81 | 0.72 |
French MT | pronouns_fr | 0.85 | 0.85 | 0.74 | 0.71 |
French MT | PoS n-grams | 0.94 | 0.87 | 0.84 | 0.78 |
French MT | selected_pos_ngrams_fr | 0.83 | 0.73 | 0.77 | 0.66 |
French MT | selected_pos_ngrams_en | 0.78 | 0.79 | 0.72 | 0.62 |
French MT no Entities | Word n-grams | 0.99 | 0.91 | 0.95 | 0.9 |