Paper Link: arxiv.org/abs/2402.18819
In-context learning (ICL) exhibits dual operating modes: task learning, i.e., acquiring a new skill from in-context samples, and task retrieval, i.e., locating and activating a relevant pretrained skill. Recent theoretical work investigates various mathematical models to analyze ICL, but existing models explain only one operating mode at a time. We introduce a probabilistic model, with which one can explain the dual operating modes of ICL simultaneously. Focusing on in-context learning of linear functions, we extend existing models for pretraining data by introducing multiple task groups and task-dependent input distributions. We then analyze the behavior of the optimally pretrained model under the squared loss, i.e., the MMSE estimator of the label given in-context examples. Regarding pretraining task distribution as prior and in-context examples as observation, we derive the closed-form expression of the task posterior distribution. With the closed-form expression, we obtain a quantitative understanding of the two operating modes of ICL. Furthermore, we shed light on an unexplained phenomenon observed in practice: under certain settings, the ICL risk initially increases and then decreases with more in-context examples. Our model offers a plausible explanation for this "early ascent" phenomenon: a limited number of in-context samples may lead to the retrieval of an incorrect skill, thereby increasing the risk, which will eventually diminish as task learning takes effect with more in-context samples. We also theoretically analyze ICL with incorrect labels, e.g., zero-shot ICL, where in-context examples are assigned random labels. Lastly, we validate our findings and predictions via experiments involving Transformers and large language models.
The following sections give guidance for reproducing all the experiments in the paper.
To replicate the experiments efficiently, download the .zip files from the provided Dropbox link and unzip them directly into the corresponding directory within your cloned or downloaded GitHub repository. (You need a Dropbox account first. Register one with any email for free! Do not spend money on it to download the data!) For instance, if the .zip file resides in "NumericalComputation/Figure4/" within Dropbox, it should be unzipped to "NumericalComputation/Figure4/" in your local repository. Please note that some experimental outcomes are not included in this link due to their execution time.
Ubuntu 22.04.3 LTS
Python 3.10.12
setproctitle 1.3.2
matplotlib 3.7.2
tqdm 4.66.1
scikit-learn 1.3.2
scipy 1.11.2
pytorch 2.0.1
cd NumericalComputation/Figure4/
python BayesianSimulation_Preprocess.py
One can reduce the sample size "K = 20000" for the Monte Carlo simulation in the code to accelerate the process, though this will likely result in increased variance.
Download and unzip the corresponding .zip file from Dropbox link.
Then run
python BayesianSimulation_Visualize.py
to get Figure4.pdf.
cd NumericalComputation/Figure5/
python 5.1.2_Preprocess.py
One can reduce the sample size "K = 80000" for the Monte Carlo simulation in the code to accelerate the process, though this will likely result in increased variance.
Download and unzip the corresponding .zip file from Dropbox link.
Then run
python 5.1.2_Visualize.py
to get Figure5.pdf.
cd NumericalComputation/Figure5/
python EarlyAscent_Preprocess.py
One can reduce the sample size "K = 10000" for the Monte Carlo simulation in the code to accelerate the process, though this will likely result in increased variance. Note: The code takes a long time to run since it loops through these parameters: d_list = [1,2,3,5,8] and demon_list = [0,1,2,4,8,16,32,64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072].
Download and unzip the corresponding .zip file from Dropbox link.
Then run
python EarlyAscent_Visualize.py
to get Figure6.pdf.
cd RealWorldLLMExperiment/Table1/
vi call_openai.py
Replace the string "yourkey" in the code with your Openai key.
For k (for instance k=4) in-context examples, run
python Ushape.py --k 4
Note: In the following codes, the inferences of llama2, mistral, and mixtral are based on vllm. One will need at least 4xA100 to run the biggest models, including mixtral and llama-2-70b-hf.
cd RealWorldLLMExperiment/Figure8/
vi call_openai.py
Replace the string "yourkey" in the code with your Openai key.
python test_gpt4.py
python test_llama-2-13b-hf.py
python test_llama-2-70b-hf.py
python test_mistral.py
python test_mixtral.py
Download and unzip the corresponding .zip file from Dropbox link.
After step 3, run:
python ZeroICL.py
The following code can be run on a single 4090 GPU.
cd TransformerExperiment/
python TS_Regular4_delta_run.py
Download and unzip the corresponding .zip file from Dropbox link.
python TS_Regular4_delta_visual.py
python TS_RegularM_run.py
Download and unzip the corresponding .zip file from Dropbox link.
python TS_RegularM_visual.py
python TS_D_d_run.py
Download and unzip the corresponding .zip file from Dropbox link.
python TS_D_d_visual.py