Implementations of the unbiased estimator and experiments from Showing Your Work Doesn't Always Work (ACL 2020).
@misc{tang2020showing,
title={Showing Your Work Doesn't Always Work},
author={Raphael Tang and Jaejun Lee and Ji Xin and Xinyu Liu and Yaoliang Yu and Jimmy Lin},
year={2020},
eprint={2004.13705},
archivePrefix={arXiv}
}
-
Clone the repository:
git clone https://github.com/castorini/meanmax && cd meanmax
-
With Python 3.7, install the requirements:
pip install -r requirements.txt
(usevirtualenv
if you want) -
That's all!
For a quick demonstration, you can use a reduced number of iterations at the cost of some precision. However, it'll be sufficient to see the effects of bias and ill-constructed confidence intervals.
To draw biased MeanMax curves, run
python -m meanmax.run.simulate --action mme -k 15 -n 15 -n 2000 --mc-total 10000
To draw unbiased MeanMax curves, run
python -m meanmax.run.simulate --action mme -k 15 -n 15 -n 2000 --mc-total 10000 --unbiased
The unbiased curve should be closer to the true curve.
To see the proportion of negative errors for the biased estimator, run
python -m meanmax.run.simulate --action "mme-test" -s 30 --start-no 25 -n 5000
For the unbiased estimator, run
python -m meanmax.run.simulate --action "mme-test" -s 30 --start-no 25 -n 5000 --unbiased
The first should be around 68 and the second around 50.
To see the empirical coverage using the percentile bootstrap, run
python -m meanmax.run.simulate --action bs -s 30 --start-no 25 -n 100 --mc-total 1000
For convenience, first run
alias process_hedwig="(tail -n +2 data/hedwig.tsv | grep reg_lstm | cut -d$'\t' -f5 && echo && tail -n +2 data/hedwig.tsv | grep mlp | cut -d$'\t' -f5)"
Then, for each of the following scripts, append the --unbiased
option to use the unbiased estimator, and --swapped
to use the MLP instead of the LSTM.
Drawing MeanMax curves: process_hedwig | python -m meanmax.run.simulate --action mme --use-kde
False conclusion probing: process_hedwig | python -m meanmax.run.simulate --action mme-test --use-kde
CI coverage: process_hedwig | python -m meanmax.run.simulate --action bs --use-kde -k <the k to test>