- BERT is powerful, and it was trained for nlp downstream tasks, but BERT is very large and not very fast to process the data.
- The benifits of building transformer encoder from scratch is that we can understand deeply how the mechanism works and customize BERT with fewer parameters.
- Code of transformer encoder is available here.
- This notebook shows how to use the package. You can also modify the code for your own understanding.
- Note: there are 2 classes, the codes of EncoderLayers and EncoderModel are the same, BUT
- EncoderLayers inherits from tf.keras.layers.Layer
- EncoderModel inherits from tf.keras.Model
- I made the following graphs to show how to build your own encoding layer. You can follow along the code with these graphs.
(Good Luck! :P)
I forgot to put attention weights in the tf.nn.softmax().
"z" is the context matrix and it needs to be transposed and reshaped.