Code and data for "Data Similarity is Not Enough to Explain Language Model Performance" (EMNLP 2023).
The file requirements.txt
contains the Python packages required to run this code.
This code is written for use with a GPU.
This repository contains the raw similarity and performance results used to construct the figures and tables in the paper:
raw-results/bigbench-raw-results-multiple-choice.csv
: overall BIG-bench Lite performances and similarities used to create the figures and tablesraw-results/cosine-similarities
: This directory contains the cosine similarities between Sentence-T5 embeddings of sampled downstream documents and all of the Pile and C4.raw-results/sampled-pretraining-idxs
: This directory contains the indices of sampled pretraining documents from the Pile and C4. These correspond to indices in the datasets when loaded using Hugging Facedatasets
. They were used for calculating KL-divergences and MAUVE scores against samples of the pretraining datasets.
The raw-results
directory also contains the performances and similarities for
Stack Exchange and XNLI datasets.
To reproduce the figures and tables from the paper, run the following commands:
python tables-1-and-3.py
: generates the tables with correlations between similarity and performance for BIG-bench Lite multiple choice taskspython table-2.py
: generates the table with correlations between similarity measures when comparing BIG-bench Lite with C4 and the Pilepython figure-1.py --reproduction
: generates the graphs of similarity and performance for Stack Exchange and BIG-bench Lite multiple choice tasks.python figure-2.py --reproduction
: generates the graphs of similarity and performance for Stack Exchange and XNLI.python figure-bigbench-violinplots.py --model Pythia-6.9B
: generates the BIG-bench Lite violinplots in figure 3python figure-4.py
: generates the GLUE violinplots in figure 4python figure-5.py --reproduction
: generates small multiples for all XNLI languagespython figure-6.py
: generates the BIG-bench Lite barplots binned by similarity quartilepython figure-bigbench-violinplots.py --model T5
: generates the BIG-bench Lite violinplots in figure 7python figure-8.py
: generates the Stack Exchange AM/PM graph of similarity and performance
Each file contains 1000 posts from Stack Exchange: 500 posts from each of the bicycles
and cstheory
Stack Exchange forums. All posts were originally in English. The percentage in the filename refers to how much of each post has been translated by Google Translate. For example, the file bicycles-cstheory_50-percent-finnish.json
contains posts whose first half is in Finnish. The original dataset is in bicycles-cstheory_100-percent-english.json
.
These datasets were adapted from Comparing Text Representations: A Theory-Driven Approach (EMNLP 2021).
Each file contains 2,500 natural language inference questions. All documents were originally in English. Again, the percentage in the filename refers to how much of each document has been translated by Google Translate. For example, the file xnli_50-percent-spanish
contains documents whose first half is in Spanish. The original XNLI datasets are those that have 100-percent
in the filename.
These datasets were adapted from XNLI: Evaluating cross-lingual sentence representations (EMNLP 2018).
Stack Exchange forum classification: Each example contains the post text
and the forum it was posted to (either 'bicycles' or 'cstheory'). We format
each example as: {text}\nThis post is about {forum}
.
The prompt is the same across languages.
Stack Exchange AM/PM classification: Each example contains the post text
and whether it was posted in the 'morning' or 'afternoon'. We format
each example as: {text}\nThis was posted in the {label}
.
The prompt is the same across languages.
XNLI: Since XNLI consists of translations of MNLI, we use an MNLI prompt
for all datasets. Examples include a premise, a hypothesis, and a label.
Labels are mapped, in order, from 'entailment', 'neutral',
'contradiction' to 'Yes', 'Maybe', 'No'. An example then becomes:
{premise} {label}, {hypothesis}
.
The prompt is the same across languages.
BIG-bench Lite: BIG-bench Lite examples come pre-formatted when loaded
through the Hugging Face datasets
library. We use the existing formatting.
GLUE: For T5 experiments, we use the prompting style described in the
original T5 paper
(Raffel et al., JMLR 2020). For example, an MNLI example with a premise and a
hypothesis becomes: 'mnli hypothesis: {hypothesis} premise: {premise} {label}'
where the label can be 'entailment', 'contradiction', or 'neutral'.
This code calculates zero-shot accuracy on BIG-bench Lite, GLUE, Stack Exchange, and XNLI. It then calculates each downstream dataset's similarity to two pretraining datasets (the Pile and C4).
sample_pretraining_data.py
:
Downloads C4 and the Pile, samples 100,000 docs at a time, and creates ST5 embeddings.
Arguments:
dataset_name
: can be eitherthe_pile
orc4
sample_num
: an integer identifier for this sample--we use integers from 1 to 8num_samples
: the number of documents to sample
n-gram-distributions-pretraining.py
:
Generate token distributions for samples from C4 and the Pile. Arguments:
dataset_name
: can be eitherthe_pile
orc4
sample_num
: an integer identifier for this sample--we use integers from 1 to 8
eval-stackexchange.py
:
Run Pythia-6.9B zero-shot eval on Stack Exchange classification tasks.
Arguments:
label_type
: Can be eitherforum
orampm
.model
: Name of atransformers
model. We report results withPythia-6.9B
.
eval-xnli.py
:
Run Pythia-6.9B zero-shot eval on XNLI classification tasks.
Arguments:
model
: Name of atransformers
model. We report results withPythia-6.9B
.
eval-bigbench.py
:
Run Pythia-6.9B zero-shot and few-shot eval on BIG-bench Lite tasks.
Arguments:
model
: Name of atransformers
model. We report results withPythia-6.9B
.
eval-bigbench-t5.py
:
Run T5-3B and T5 v1.1 XL zero-shot and few-shot eval on BIG-bench Lite tasks.
Arguments:
model
: Name of atransformers
model. Options aret5
for finetuned T5-3B andt5-v1_1-xl
for T5 v1.1 XL.
eval-glue.py
:
Run zero-shot eval on GLUE tasks.
model
: Name of atransformers
model.
Run the following files to create token distributions:
n-gram-distributions-bigbench.py
n-gram-distributions-stackexchange.py
n-gram-distributions-xnli.py
Run the following file to construct embeddings:
embeddings-bigbench.py
embeddings-glue.py
.
Run the following files:
process-stackexchange.py
process-xnli.py
process-bigbench.py
process-glue.py
We are unable to include our code for calculating cosine similarities between
entire pretraining datasets and examples from downstream datasets, but the
scripts in this section calculate such similarities against a sample of the
pretraining dataset. We include our raw results against the entire pretraining
dataset in the cosine-similarities
directory.
Run the following files to generate the tables and figures in the paper:
tables-1-and-3.py
: Generates tables 1 and 3.table-2.py
: Generates tables 2.figure-1.py
: Generates figure 1.figure-2.py
: Generates figure 2.figure-bigbench-violinplots.py
: Generates figures 3 and 7. Arguments aremodel
(eitherPythia-6.9B
orT5
) andsimilarity_type
(eithermax
ormean
).figure-4.py
: Generates figure 4.figure-5.py
: Generates figure 5.figure-6.py
: Generates figure 6. Arguments aremodel
(eitherPythia-6.9B
orT5
).process-stackexchange-ampm.py
: Generates figure 8.
This table contains the raw performance results for BIG-bench Lite multiple choice tasks that are used to construct tables and figures but that are not explicitly included in the main paper.
Dataset | Pythia-6.9B (0 shot) | T5 v1.1 XL (0 shot) | T5 v1.1 XL (2 shot) | Flan-T5 XL (0 shot) | Flan-T5 XL (2 shot) |
---|---|---|---|---|---|
bbq_lite_json | -22.02 | 4.80 | 8.75 | 19.90 | 37.76 |
code_line_description | -2.00 | -10.86 | -10.86 | 17.96 | 24.61 |
conceptual_combinations | 0.00 | 0.32 | 5.50 | 52.10 | 41.75 |
emoji_movie | 6.25 | -1.25 | -5.00 | 3.75 | 3.75 |
formal_fallacies_syllogisms_negation | -0.07 | 0.00 | 0.00 | 1.18 | 2.89 |
hindu_knowledge | 1.96 | -0.51 | 1.02 | 14.72 | 16.24 |
known_unknowns | 4.35 | 0.00 | 4.35 | -8.70 | -17.39 |
language_identification | 5.49 | 0.25 | -0.08 | 12.34 | 4.73 |
logic_grid_puzzle | 18.88 | -2.60 | -2.60 | 8.43 | 0.01 |
logical_deduction | 0.00 | -1.04 | 0.16 | 15.91 | 11.09 |
novel_concepts | -8.53 | -21.09 | -17.19 | -9.38 | -1.56 |
operators | 0.00 | 0.95 | 0.00 | 5.71 | 6.19 |
play_dialog_same_or_different | 21.94 | -26.16 | -26.16 | 18.75 | 15.63 |
strange_stories | -3.02 | -9.56 | -5.79 | 46.16 | 34.83 |
strategyqa | 3.89 | -6.42 | -6.51 | 7.56 | 28.35 |
symbol_interpretation | -1.01 | -0.51 | -2.40 | 1.01 | 1.01 |
vitaminc_fact_verification | 24.57 | 5.34 | 5.34 | 56.71 | 41.31 |