Transformer Encoder

BERT is powerful, and it was trained for nlp downstream tasks, but BERT is very large and not very fast to process the data.
The benifits of building transformer encoder from scratch is that we can understand deeply how the mechanism works and customize BERT with fewer parameters.
Code of transformer encoder is available here.
This notebook shows how to use the package. You can also modify the code for your own understanding.
Note: there are 2 classes, the codes of EncoderLayers and EncoderModel are the same, BUT
- EncoderLayers inherits from tf.keras.layers.Layer
- EncoderModel inherits from tf.keras.Model
I made the following graphs to show how to build your own encoding layer. You can follow along the code with these graphs.
(Good Luck! :P)

I forgot to put attention weights in the tf.nn.softmax().
"z" is the context matrix and it needs to be transposed and reshaped.

Provide feedback