Skip to content

Commit

Permalink
modified math
Browse files Browse the repository at this point in the history
  • Loading branch information
Demi-wlw committed Jul 28, 2024
1 parent 0bf0989 commit 4557403
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions _posts/2023-03-19-ChatGPT.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,12 +89,12 @@ GPT has been a major breakthrough in natural language processing and the version

### Generative pre-training

The term _generative pre-training_ represents the unsupervised pre-training of the generative model.<d-footnote>They used a multi-layer Transformer decoder to produce an output distribution over target tokens.</d-footnote> Given an unsupervised corpus of tokens $\mathcal{U} = \{u_1,\dots,u_n\}$, they use a standard language modelling objective to maximize the following likelihood:
The term _generative pre-training_ represents the unsupervised pre-training of the generative model.<d-footnote>They used a multi-layer Transformer decoder to produce an output distribution over target tokens.</d-footnote> Given an unsupervised corpus of tokens \(\mathcal{U} = \{u_1,\dots,u_n\}\), they use a standard language modelling objective to maximize the following likelihood:
{: .text-justify}

\begin{equation}
\[
L_1(\mathcal{U})=\sum_i\log P(u_i|u_{i-k},\dots,u_{i-1};\Theta)
\end{equation}
\]

where $k$ is the size of the context window, and the conditional probability $P$ is modelled using a neural network with parameters $\Theta$ trained using stochastic gradient descent. **Intuitively, we train the Transformer-based model to predict the next token within the $k$-context window using unlabeled text from which we also extract the latent features $h$.**
{: .text-justify}
Expand All @@ -104,9 +104,9 @@ where $k$ is the size of the context window, and the conditional probability $P$
After training the model with the objective function above, they adapt the parameters to the supervised target task which refers to supervised fine-tuning. Assume a labelled dataset $\mathcal{C}$, where each instance consists of a sequence of input tokens, $x^1,\dots, x^m$, along with a label $y$. The inputs are passed through the pre-trained model to obtain the final transformer block's activation $h_l^m$, which is then fed into an added linear output layer with parameters $W_y$ to predict $y$:
{: .text-justify}

$$
\[
P(y|x^1,\dots,x^m)=softmax(h_l^mW_y).
$$
\]

This gives us the following objective to maximize:

Expand Down Expand Up @@ -134,7 +134,7 @@ Some tasks, like question answering or textual entailment, have structured input
Few-Shot is the term referring to the setting where the model is given a few demonstrations of the task at inference time as conditioning, but no weight updates are allowed. _Few-shot learning_ involves learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task. The primary goal in traditional Few-Shot frameworks is to learn a similarity function that can map the similarities between the classes in the support and query sets.
{: .text-justify}

Figure 2.1 <d-cite key="GPT-3"></d-cite> illustrates different settings, from which we see for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $$K$$ examples of context and completion, and then one final example of context, with the model expected to provide the completion.
Figure 2.1 <d-cite key="GPT-3"></d-cite> illustrates different settings, from which we see for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $K$ examples of context and completion, and then one final example of context, with the model expected to provide the completion.
{: .text-justify}

The main advantages of few-shot are a major reduction in the need for task-specific data and a reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task-specific data is still required.
Expand Down

0 comments on commit 4557403

Please sign in to comment.