modified math

Demi-wlw · Jul 28, 2024 · 4557403 · 4557403
1 parent 0bf0989
commit 4557403
Showing 1 changed file with 6 additions and 6 deletions.
diff --git a/_posts/2023-03-19-ChatGPT.md b/_posts/2023-03-19-ChatGPT.md
@@ -89,12 +89,12 @@ GPT has been a major breakthrough in natural language processing and the version
 
 ### Generative pre-training
 
-The term _generative pre-training_ represents the unsupervised pre-training of the generative model.<d-footnote>They used a multi-layer Transformer decoder to produce an output distribution over target tokens.</d-footnote> Given an unsupervised corpus of tokens $\mathcal{U} = \{u_1,\dots,u_n\}$, they use a standard language modelling objective to maximize the following likelihood:
+The term _generative pre-training_ represents the unsupervised pre-training of the generative model.<d-footnote>They used a multi-layer Transformer decoder to produce an output distribution over target tokens.</d-footnote> Given an unsupervised corpus of tokens \(\mathcal{U} = \{u_1,\dots,u_n\}\), they use a standard language modelling objective to maximize the following likelihood:
 {: .text-justify}
 
-\begin{equation}
+\[
 L_1(\mathcal{U})=\sum_i\log P(u_i|u_{i-k},\dots,u_{i-1};\Theta)
-\end{equation}
+\]
 
 where $k$ is the size of the context window, and the conditional probability $P$ is modelled using a neural network with parameters $\Theta$ trained using stochastic gradient descent. **Intuitively, we train the Transformer-based model to predict the next token within the $k$-context window using unlabeled text from which we also extract the latent features $h$.**
 {: .text-justify}
@@ -104,9 +104,9 @@ where $k$ is the size of the context window, and the conditional probability $P$
 After training the model with the objective function above, they adapt the parameters to the supervised target task which refers to supervised fine-tuning. Assume a labelled dataset $\mathcal{C}$, where each instance consists of a sequence of input tokens, $x^1,\dots, x^m$, along with a label $y$. The inputs are passed through the pre-trained model to obtain the final transformer block's activation $h_l^m$, which is then fed into an added linear output layer with parameters $W_y$ to predict $y$:
 {: .text-justify}
 
-$$
+\[
   P(y|x^1,\dots,x^m)=softmax(h_l^mW_y).
-$$
+\]
 
 This gives us the following objective to maximize:
 
@@ -134,7 +134,7 @@ Some tasks, like question answering or textual entailment, have structured input
 Few-Shot is the term referring to the setting where the model is given a few demonstrations of the task at inference time as conditioning, but no weight updates are allowed. _Few-shot learning_ involves learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task. The primary goal in traditional Few-Shot frameworks is to learn a similarity function that can map the similarities between the classes in the support and query sets.
 {: .text-justify}
 
-Figure 2.1 <d-cite key="GPT-3"></d-cite> illustrates different settings, from which we see for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $$K$$ examples of context and completion, and then one final example of context, with the model expected to provide the completion.
+Figure 2.1 <d-cite key="GPT-3"></d-cite> illustrates different settings, from which we see for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $K$ examples of context and completion, and then one final example of context, with the model expected to provide the completion.
 {: .text-justify}
 
 The main advantages of few-shot are a major reduction in the need for task-specific data and a reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task-specific data is still required.