When I started to study the Transformer model, I found that some important details of the model implementation were not totally clear and I needed to search for other implementations or explanations of these details.
For this reason, I decided to report some clarifications for the most important doubts that I had, hoping that this could help some other researchers!
These explanations assume a basic knowledge of the transformer models (e.g. Encoder-Decoder architecture, Multi-Head Attention Mechanism, tokenization, etc.), avoiding creating a redundant repository over millions already present on the web. In this way, I can focus specifically on the ambiguities.
This Repo offers:
2. A complete, clear, and commented implementation of the Transformer model in Pytorch and Pytorch Lightning.
The very well-known image that depicts the transformer architecture hides a lot of important information that is useful for the correct implementation.
Some of the first questions that came up in my mind when I had a look at this picture were:
The encoder and the decoder can have multiple layers (N as reported). The output of the encoder seems to be connected to the decoder. But! Into which layer?? The last one, the first one?? All of them??
As reported in:
Picture taken by (https://www.truefoundry.com/blog/transformer-architecture)
Every attention block has three inputs that should Query, Key, and Value. Which one is what??
The Keys and the Values come from the Encoder, the Queries come from the last sublayer of the decoder.
Both the above answers could be extracted with a bit of interpretation from:
Notice the phrase:This allows every position in the decoder to attend over all the positions in the input sequence, this sentence will also be useful later.
In the rest of the README we'll call:
- Self-Attention block of the encoder: the attention block of the encoder (of course :) )
- Masked-Self-Attention block of the decoder: you got it!
- Cross-Attention block: the block where the encoder is connected to the decoder.
Later a more detailed answer!
I admit that I struggled a bit to understand well how the masking is used in this model, mainly because a looot of things are given for granted, and appear clear and obvious only when you start to implement things and problems come up.
First of all, I would have named the "Look Ahead Mask" as the "DON'T Look Ahead Mask". This mask is used by the decoder to allow the computation of attention only backward in the sentence.
Yes, it makes sense, but why?? Well, because at the inference time, the decoder will act in an auto-regressive manner, which means that it only has the encoder's input as a complete sentence, and the decoder should generate a word at a time during inference. Hence, only using the already generated words. For this reason, we need to force at training time to learn to predict the ground-truth output sentence without looking at the next words, otherwise, that's cheating!
Here we report the shape of the "Don't look ahead mask" also called "Causal Mask":
Notice that the size of the mask is
The matrix is composed of zeros and
Notice the mask is inside the softmax function.
This is done because if we consider
Now, the
Remind that
Hence, when the value is
With an example everything is always clearer!
That of course is simmetric. Moreover, we have that
Now we need to apply the softmax function ROW-WISE. Why row-wise? because remember that we are using column vectors:
1. The softmax function is numerically unstable for $-\infty$ . For this reason, we need to modify $-\infty$ values in a VERY HIGH NEGATIVE VALUE like -1E15;
This could be trivial for the practitioners but it's important to explicate everything (the repo is called TransformerForDummies after all :D)
First of all, remember what the "dimensions" mean in pytorch: dim = 0, means that you are indexing through the rows! Dim = 1 means that you are indexing through the columns.
However, the Pytorch documentation of the softmax function reports:
That in this case means that every row will be "collapsed" independently to compute the softmax. Hence, after the:
values = torch.softmax(values, dim=-1)
Using the last dimension! That in our case will be all the whole rows!
We'll have:
The sum "for each row" is always 1.0, try to believe!
Finally, we can compute the output values of the attention mechanism:
The results is:
This new vector represents a weighted combination of the values of
The Padding mask could seem trivial at first sight, but it has its own quibbles. The first reason why it is necessary is that not all the sentences have the same length! We:
- Add Padding tokens to bring all the sentences to have the same length;
- Create a mask that "block" the softmax function to consider this token that are uninformative.
2) Wait? But the encoder's input and the decoder's input can have different lengths? What about the padding then?
Let's assume that we have the batch size equal to 1, the encoder output is
First of all, the
About the two sequence lengths instead, we remind from the answer 2, that the decoder offers the query to the attention, the encoder the keys, and the values instead. Hence,
This first explains why the embedding size should be equal for the both encoder and the decoder (basic linear algebra).
Then, after the attention computation:
where the pedices
From a practical point of view though, we need to understand when have different lengths is convenient, necessary or else:
- Training:
- During the training, the batch size is larger than 1, so the padding IS NECESSARY;
- In theory, it is also possible to create batches for the encoder and the decoder of different lengths (sequence lengths, not the batch size of course). This can be annoying from the implementation point of view, but it could be convenient if there is a large difference in the lengths of the sequences between the two languages (if we consider a translation task);
- In practise during the training, the dataloader is often implemented using the same lengths for the encoder's and decoder's inputs.
- Inference:
- At inference time (manually testing the model for example) we often use just one input, in this case, we don't need the padding since the batch size = 1.
- On the other hand, if we implemented the model in such a way it is possible to have different sizes of the encoder's input and output, we don't even need the padding for the input.
Recap:
- The padding is used for two reasons:
- Aligning the sequences for the same batch;
- Aligning the sequences between the two batches of encoder and decoder (depends on the implementation).
First, if we want to talk about padding mask we need to consider the Batch size > 1 that we'll name
Now, we'll use an arbitrary value for the padding token
As an example, the "proto-padding-mask" where
Remember that the scaled-dot-product attention function with a generic mask is:
for the operation
Now, for each sentence in the set of size
Considering every element like
It's easy to see that every position in which we have a multiplication by the padding token (actually a dot product because every entry is
Hence, our padding mask for the third sentence will be:
It's easy to derive this mask with these operations:
B = 1
L = 6
padding_mask = torch.FloatTensor([False, False, False, False, True, True]).unsqueeze(0).unsqueeze(0)
padding_mask_right = padding_mask.repeat(1, L, 1)
padding_mask_left = padding_mask_right.transpose(-1, -2)
padding_mask = (padding_mask_left | padding_mask_right).float()
padding_mask[padding_mask == 1.] = -torch.inf
but I'm pretty sure more efficient ways exists.
It's important to notice also from the implementation, that the padding mask is like it is composed by two masks. This is because
Hence, we'll have a different Padding mask for each sentence.
This is probably one of the hardest questions I had to find an answer to. Let's start with the most trivial things. The Masked-Self-Attention block of course needs the Causal Mask, and that's ok. However, the most reasonable thing is that both the Self-attention block of the encoder and the Masked-Self-Attention block of the Decoder, also need a padding Mask. This is because as reported in the article:
- "The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder."
- "Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2"
When the article mentions that the self-attention blocks should attend to "all the positions", it's reasonable to think that only the meaningful part should be attended to, so excluding the padding token. Hence, until now we have: the Encoder's Self-Attention block needs the Padding Mask; the Decoder's Masked-Self-Attention block needs padding Mask + Causal Mask.
The article reports:
So, if we need to consider the same rationale where "all the positions" means all the meaningful positions, Do we need to combine two padding masks??, the encoder's and the decoder's, also considering that Queries come from the decoder and the Keys from the encoder?? However, since I didn't want to speculate much, I needed to investigate more.
First of all, I found that the same question has been asked a lot around the web, but few times I've seen a reasonable answer: HERE HERE HERE HERE HERE HERE HERE
Unfortunately, not all the answers were clear and agreed with each other. Despite this, I tried to have my answer, mainly based on these factors:
- The official Pytorch Implementation of the Transformer model has as parameter the memory_mask HERE
- This article reports that it is necessary to avoid conflict. Which conflict? Not explained.
- This instead reports that the memory mask is just the same as the encoder-input's Padding mask, so in general applied to the Keys. Ok, but why?
Ok, my catch on this is:
- The Cross-Attention block needs a padding Mask;
- In the official implementations there is what is called Memory Mask that seems to be a copy of the encoder's input padding mask;
- I haven't found anything about the inclusion of the decoder's input padding mask.
However, I wasn't satisfied with this. I had to prove the sense by myself.
So, let's start with an example where queries come from the decoder, and the keys and values are the same vector from the encoder output.
Where
Now let's consider the three possibilities for the Padding mask: encoder's input Padding mask, decoder's input padding mask, and a combination of both.
More precisely, since the computation of the
Ok, now let's apply the three possibilities, and see what happens.
Where
Here I called
Finally, the combination of both the padding masks.
First!
Using the decoder's input padding mask would create dirty values. Hence, using the right encoder's input padding mask is the best choice. Not using any padding mask for the Cross-Attention block would create dirty values.
Just to experimentally validate this assertion I trained a simple Transformer model and I found that with the right padding mask for the Cross-Attention block leads to better validation accuracy respect to not using any. (7.154 vs 7.3 of Validation loss after 1 epoch)
Where the pedices
The embeddings layer are used to map each token into a vector.
To allow this it's easy to just use the torch.nn.Embedding(num_embeddings, embedding_dim, ...) class. Internally the class is just a linear layer that maps an integer into a vector. Still more under the hood, each integer is considered in one-hot-encoding.
Hence, the parameters will be:
- num_embeddings = VOCABULARY SIZE
- embedding_dim = EMBEDDING SIZE
Unfortunately for this reason, the embedding layer is one of the storage heavy part of the model. Let's make an example:
VOCABULARY SIZE = 50k and EMBEDDING SIZE = 512, we'll have a linear layer of
Moreover, considering that we have two different embeddings layers ( one for the encoder and one for the decoder), we have more than 50 millions parameters just for the first step of the processing. Remind that this layer is trainable.
Even if this part is almost straightforward, in the paper is the most ambiguous one.
It's intuitive that we just need a linear layer and a softmax to have a "vocabulary-sized" vector of probabilities to sample the most probable next word. However, let's read:
We first read
- [...], we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, [...].
Wait Wait Wait!
- I can accept a weights sharing between the embedding layer of the Decoder and its last layer, because maybe we just want to save some parameters and because the vocabulary for the target sentence in the decoder is the same in output of course...
- But WHY?? It should be shared with the Encoder's embedding layer, that probably will have a different vocabulary, since this model is generally considered for a task like Translation??? Am I missing something? 😖
I searched a lot and I found one only sensed answer, thanks to 'noe' on Datascience.Stackexchange:
-
The source and target embeddings can be shared or not. This is a design decision. They are normally shared if the token vocabulary is shared, and this normally happens when you have languages with the same script (i.e. the Latin alphabet). If your source and target languages are e.g. English and Chinese, which have different writing systems, your token vocabularies would probably not be shared, and then the embeddings wouldn't be shared either. NOE
-
Then, the linear projection before the softmax can be shared with the target embedding matrix. This is also a design decision. It is frequent to share them. NOE
uff...okok I took a sigh of relief, it was as I thought, just a task dependent design choice.
For the all answer refer HERE
Hence, my recap is:
This is actually a design choice also to reduce the computation.
2. Encoder Embedding Layer and Decoder Embedding Layer can share the weights in the case the source and the target languages are the same.
So, in this case all the three layer share the same weights as reported in the article.
Ok, let's continue to read:
- [...] In the embedding layers, we multiply those weights by
$\sqrt{d_{model}}$ .
...Totally out of nowhere...why now???:weary:
After a very long search and time thinking about it..
The answer is that there is no answer! As also reported in HERE HERE HERE HERE
Actually my catch on this turns around a couple of thoughs:
- Inside the attention blocks all the dot-product are scaled by
$\sqrt{d_{model}}$ that is the standard deviation of a dot-product between two independent random vector, though scaling in such a way everything has a variance of 1; - The layer normalization largely used is done exactly to keep every vector to variance of 1;
- From the scheme it's possible to see that we always have the layer normalization as output of both encoder and decoder.
Hence, my idea is that since the actual vectors that represent the tokens as inputs of both encoder and decoder "don't have variance of 1" ( I'm talking about the embedding from the embeddings layers), we need to rescaled them multiplying them back by
Every comment on this is largely accepted.
The only interesting thing that I'd like to report for this is that the normalization makes use of the Biased Variance and not the unbiased one (strengthening even more my idea on the rescaling by
We remind that:
So keep an eye on this if you want to reimplement this by yourself.
The article reports:
- Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network.
- We apply dropout [ 33] to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.[...]
Hence, we deduce that the dropout layers are reported how depicted in the picture below:
What it not mentioned in the article is that the Dropout is also implemented inside the attention mechanism HERE HERE:
- After the Softmax function add a dropout
Why we need to use the special tokens? Around the web and in several papers a lot of different tokens are used.
Let's consider the inference time, so we are using our already trained model, and we want to translate a source sentence into a target sentence.
We already have an input sequence for the encoder, but how do we start the input of the decoder??
We need a starting point from which we can compute the whole sequence, that in theory should be that first word of the translation that we do not know!
For this reason it's enough a dummy word that we'll call [SOS] (Start Of Sentence).
Let's say
$f_d^1([SOS], f_e([The, dog, is, beautiful)) = [Il]$ $f_d^2([Il], f_e([The, dog, is, beautiful)) = [cane]$ $f_d^3([cane], f_e([The, dog, is, beautiful)) = [é]$ $f_d^4([è], f_e([The, dog, is, beautiful)) = [bello]$
The [EOS] token (End Of Sentence) it's necessary for exactly the opposite reason of the start token. We need to stop the generation of words. Considering that the generation is one token at time, so practically in a for loop, we need a way to stop the generation but also allow the model to learn when to stop the generation as well. For this reason we need the [EOS] to be set at the end of the sentence for the decoder.
$f_d^4([è], f_e([The, dog, is, beautiful)) = [bello]$ $f_d^5([bello], f_e([The, dog, is, beautiful)) = [come]$ $f_d^6([il], f_e([The, dog, is, beautiful)) = [tramonto]$ - ... it can continue gibbering..
The right way:
$f_d^4([è], f_e([The, dog, is, beautiful)) = [bello]$ $f_d^5([bello], f_e([The, dog, is, beautiful)) = [EOS]$ - STOP
In this way we know when to stop inferencing.
The encoder, at least in principle, doesn't need the [SOS] nor the [EOS] token. However, these are often used in the encoder as well, mainly to help the model to understand when the input sequence of the encoder start and finishes, in this way can influence the generation or the termination of the output sequence. HERE HERE
The padding is just added right after the [EOS].
Now the crispy things! All the guides that I found were boring, redundant and somewhat unclear on the peculiarity of the transformer training that in my opinion is base on only two things:
- Shift Left the ground-truth output of just one step;
- Set the CrossEntropyLoss to ignore the paddings!
In the paper is depicted as "Output (Shifted right)", very confusing in my opinion.
Anyway, let's make an example: The ground truth output is
$out-rolled = [cane, è, bello, \text{PAD}, \text{PAD}, \text{PAD}, Il]$
Set the last as padding (in a moment you'll understand why):
$out-rolled = [cane, è, bello, \text{PAD}, \text{PAD}, \text{PAD}, \text{PAD}]$
target_batch_out = torch.roll(target_batch, -1, dims=-1)
target_batch_out[:, -1] = self.padding_index
When we compute the loss we don't need to match the paddings, since are just blank spaces. We need to compute it only for the meaningful tokens. Fortunately, the nn.CrossEntropyLoss(...) class has the ignore_index parameter that you can easily set.
self.loss = nn.CrossEntropyLoss(ignore_index=self.padding_index)
Of course other faster implemenations are possible.
As already heard many times, the inference is done in autoregressive way. This means that the output depends on all the previous values. However,
How come the output of the decoder is of the same size of its input even though we just need the next token?
Well for the first token is simple: the input will be
out = model(encoder_input=tokenized_sentence,
decoder_input=decoder_input)
## The output will be of size (Batch_size, sequence_length, vocab_size)
out = torch.argmax(torch.softmax(out[:, -1, :], dim=-1)) ## Take just the last one
decoder_input = torch.cat([decoder_input, out.unsqueeze(0).unsqueeze(0)], dim=-1)
Well, remember the example of the causal mask, that I'll report here:
The output vector is of our concern. The first component of the output vector consider only the first token, the second component the first two, the third one the first three, and so on. For this reason, we'll just take the last element.