Skip to content

Latest commit

 

History

History
111 lines (79 loc) · 5.15 KB

README.md

File metadata and controls

111 lines (79 loc) · 5.15 KB

Dialectal Varieties

Corpus:

  • Contains also the EU documents
  • Each document has 66 sentences of ~2000 words
  • The *.chnk files contain each data point on a line with sentences split by '#%'
  • The original indices before shuffling are stored in *.orig_idx
  • Shuffled documents are found in *.shf

Each test set contains 40 documents from each class, and the training sets have a maximum of 197 documents (the minimum nr given Scotland, the smallest class)

Running the code

1. Install requirements

Install requirements. Spacy is only needed for to obtain PoS tags.

pip3 install requirements.txt

Download spacy models for English and French.

# you may choose to download other models if you wish so
python3 -m spacy download en_core_web_trf
python3 -m spacy download fr_dep_news_trf # for PoS tagging
python3 -m spacy download fr_core_news_lg # for NER

Optional - split raw files into chunks

This is not needed, as the splits are already released in this repository. If one wishes to split the files into chunks and to do the train-test split, run:

python3 split_extract.py

Optional - machine translate English documents to French

We used an En-Fr readily-available MT model implemented in fairseq-py. The fairseq model is available for download and may be loaded into fairseq by following the tutorial from the official repository. The MT model is only important to show that dialect-markers from the source language are preserved in the MT output.

2. Generate PoS-tagged and no-entity files

The annotated train-test splits used in the experiments are available here. This script will generate PoS-tagged copies of the text files and copies that have all the annotated entities replaced. It is intended to be used with the English, French, and MT French train-test splits. Check the source code, if you wish to run it on a custom directory.

cd src && python3 make_pos_dirs.py

3. Generate some tables from text classification experiments

This will run all the configurations in the *_classifications.py scripts. Currently it does classifications using function words, pronouns, PoS n-grams, word n-grams.

cd src && \
python3 en_classifications.py && \
python3 fr_classifications.py

The classification scores are obtained by doing multiple experiments on downsampled data. Each class is split into batches of equal size and for each split, we train-test a classifier.

Directories:

  • corpus/* : directories of texts produced by native, non-native MEP's
  • corpus/train_test_split: the train test splits for comparing classification similarities across languages
  • analogies: csv files of misaligned words across corpora
  • feature_selection: directory, we saved the accuracy, F1 score and cofusion matrices for different classification scenarios, filenames: FEATURE_LANGUAGE_PAIRS.ftrs
  • features: lists of word and PoS features
  • src: source file directory

Maximum nr of sents / doc 66 Scotland chunks: 237 England chunks: 911 Ireland chunks: 476 EU chunks: 449

Results

Original English Data

setup feature En vs. Ir En vs. Sc Ir vs. Sc 3-way
English Originals function_words_en 0.9 0.91 0.85 0.8
English Originals pronouns_en 0.63 0.76 0.69 0.57
English Originals PoS n-grams 0.91 0.87 0.91 0.83
English Originals selected_pos_ngrams_en 0.88 0.85 0.86 0.78
English Originals selected_pos_ngrams_fr 0.82 0.71 0.77 0.64
English no Entities Word n-grams 0.91 0.89 0.92 0.83

French Human and Machine Translated Data

setup feature En vs. Ir En vs. Sc Ir vs. Sc 3-way
French Translations function_words_fr 0.84 0.87 0.78 0.71
French Translations pronouns_fr 0.82 0.8 0.72 0.66
French Translations PoS n-grams 0.89 0.82 0.76 0.74
French Translations selected_pos_ngrams_fr 0.78 0.76 0.62 0.59
French Translations selected_pos_ngrams_en 0.8 0.76 0.71 0.59
French no Entities Word n-grams 0.97 0.91 0.95 0.9
French MT function_words_fr 0.88 0.84 0.81 0.72
French MT pronouns_fr 0.85 0.85 0.74 0.71
French MT PoS n-grams 0.94 0.87 0.84 0.78
French MT selected_pos_ngrams_fr 0.83 0.73 0.77 0.66
French MT selected_pos_ngrams_en 0.78 0.79 0.72 0.62
French MT no Entities Word n-grams 0.99 0.91 0.95 0.9