diff --git a/_posts/2023-03-19-ChatGPT.md b/_posts/2023-03-19-ChatGPT.md index 4c405ebf4210..1ab31038d17a 100644 --- a/_posts/2023-03-19-ChatGPT.md +++ b/_posts/2023-03-19-ChatGPT.md @@ -89,12 +89,12 @@ GPT has been a major breakthrough in natural language processing and the version ### Generative pre-training -The term _generative pre-training_ represents the unsupervised pre-training of the generative model.They used a multi-layer Transformer decoder to produce an output distribution over target tokens. Given an unsupervised corpus of tokens $\mathcal{U} = \{u_1,\dots,u_n\}$, they use a standard language modelling objective to maximize the following likelihood: +The term _generative pre-training_ represents the unsupervised pre-training of the generative model.They used a multi-layer Transformer decoder to produce an output distribution over target tokens. Given an unsupervised corpus of tokens \(\mathcal{U} = \{u_1,\dots,u_n\}\), they use a standard language modelling objective to maximize the following likelihood: {: .text-justify} -\begin{equation} +\[ L_1(\mathcal{U})=\sum_i\log P(u_i|u_{i-k},\dots,u_{i-1};\Theta) -\end{equation} +\] where $k$ is the size of the context window, and the conditional probability $P$ is modelled using a neural network with parameters $\Theta$ trained using stochastic gradient descent. **Intuitively, we train the Transformer-based model to predict the next token within the $k$-context window using unlabeled text from which we also extract the latent features $h$.** {: .text-justify} @@ -104,9 +104,9 @@ where $k$ is the size of the context window, and the conditional probability $P$ After training the model with the objective function above, they adapt the parameters to the supervised target task which refers to supervised fine-tuning. Assume a labelled dataset $\mathcal{C}$, where each instance consists of a sequence of input tokens, $x^1,\dots, x^m$, along with a label $y$. The inputs are passed through the pre-trained model to obtain the final transformer block's activation $h_l^m$, which is then fed into an added linear output layer with parameters $W_y$ to predict $y$: {: .text-justify} -$$ +\[ P(y|x^1,\dots,x^m)=softmax(h_l^mW_y). -$$ +\] This gives us the following objective to maximize: @@ -134,7 +134,7 @@ Some tasks, like question answering or textual entailment, have structured input Few-Shot is the term referring to the setting where the model is given a few demonstrations of the task at inference time as conditioning, but no weight updates are allowed. _Few-shot learning_ involves learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task. The primary goal in traditional Few-Shot frameworks is to learn a similarity function that can map the similarities between the classes in the support and query sets. {: .text-justify} -Figure 2.1 illustrates different settings, from which we see for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $$K$$ examples of context and completion, and then one final example of context, with the model expected to provide the completion. +Figure 2.1 illustrates different settings, from which we see for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $K$ examples of context and completion, and then one final example of context, with the model expected to provide the completion. {: .text-justify} The main advantages of few-shot are a major reduction in the need for task-specific data and a reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task-specific data is still required.