diff --git a/_posts/2023-03-19-ChatGPT.md b/_posts/2023-03-19-ChatGPT.md
index 4c405ebf4210..1ab31038d17a 100644
--- a/_posts/2023-03-19-ChatGPT.md
+++ b/_posts/2023-03-19-ChatGPT.md
@@ -89,12 +89,12 @@ GPT has been a major breakthrough in natural language processing and the version
### Generative pre-training
-The term _generative pre-training_ represents the unsupervised pre-training of the generative model.They used a multi-layer Transformer decoder to produce an output distribution over target tokens. Given an unsupervised corpus of tokens $\mathcal{U} = \{u_1,\dots,u_n\}$, they use a standard language modelling objective to maximize the following likelihood:
+The term _generative pre-training_ represents the unsupervised pre-training of the generative model.They used a multi-layer Transformer decoder to produce an output distribution over target tokens. Given an unsupervised corpus of tokens \(\mathcal{U} = \{u_1,\dots,u_n\}\), they use a standard language modelling objective to maximize the following likelihood:
{: .text-justify}
-\begin{equation}
+\[
L_1(\mathcal{U})=\sum_i\log P(u_i|u_{i-k},\dots,u_{i-1};\Theta)
-\end{equation}
+\]
where $k$ is the size of the context window, and the conditional probability $P$ is modelled using a neural network with parameters $\Theta$ trained using stochastic gradient descent. **Intuitively, we train the Transformer-based model to predict the next token within the $k$-context window using unlabeled text from which we also extract the latent features $h$.**
{: .text-justify}
@@ -104,9 +104,9 @@ where $k$ is the size of the context window, and the conditional probability $P$
After training the model with the objective function above, they adapt the parameters to the supervised target task which refers to supervised fine-tuning. Assume a labelled dataset $\mathcal{C}$, where each instance consists of a sequence of input tokens, $x^1,\dots, x^m$, along with a label $y$. The inputs are passed through the pre-trained model to obtain the final transformer block's activation $h_l^m$, which is then fed into an added linear output layer with parameters $W_y$ to predict $y$:
{: .text-justify}
-$$
+\[
P(y|x^1,\dots,x^m)=softmax(h_l^mW_y).
-$$
+\]
This gives us the following objective to maximize:
@@ -134,7 +134,7 @@ Some tasks, like question answering or textual entailment, have structured input
Few-Shot is the term referring to the setting where the model is given a few demonstrations of the task at inference time as conditioning, but no weight updates are allowed. _Few-shot learning_ involves learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task. The primary goal in traditional Few-Shot frameworks is to learn a similarity function that can map the similarities between the classes in the support and query sets.
{: .text-justify}
-Figure 2.1 illustrates different settings, from which we see for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $$K$$ examples of context and completion, and then one final example of context, with the model expected to provide the completion.
+Figure 2.1 illustrates different settings, from which we see for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $K$ examples of context and completion, and then one final example of context, with the model expected to provide the completion.
{: .text-justify}
The main advantages of few-shot are a major reduction in the need for task-specific data and a reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task-specific data is still required.