This is the official repo for the following paper
- The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions, Xiang Zhou, Yixin Nie, Hao Tan and Mohit Bansal, EMNLP 2020 (arxiv)
This code requires Python 3. All the dependencies are specified in "requirement.txt"
pip install -r requirements.txt
The current code supports the calculation of decomposed variance metrics from standard evaluation numbers.
-
Download the NLI datasets and put it under the
nli_data
folder in the root directory -
Organize the evaluation result of your model under the
models
directly in the same way as theberts
(an example folder showing the result of BERT-base) folder, name of the folder representing the model typeMODEL_TYPE/seed_x
saves the evaluation results with seedx
- Inside
MODEL_TYPE/seed_x/
, each folder represent the evaluation result on one dataset, including three files:eval_results.txt
: Final accuracy of the modellogits_results.txt
: List of logits output by the model on every example in the datasetpred_results.txt
: List of labels predicted by the model on every example in the dataset
-
Run the evaluation scripts by
python variance_report.py MODEL_PATH
Other scripts (training/evaluation/analysis) and model checkpoints that are used in the paper will come soon.