Skip to content

Latest commit

 

History

History
43 lines (29 loc) · 3.74 KB

README.md

File metadata and controls

43 lines (29 loc) · 3.74 KB

🚗 Transformer

Implementation of the architecture Attention Is All You Need, the paper that introduced a novel approach to sequence processing tasks using Attention Mechanisms instead of the classic RNN based approach. They achieved state-of-the-art results while significantly reducing train time and increasing network throughput.

We trained the model on two separate tasks, neural translation, the task from the original paper, and text summarization, the task of generating a short summarization given a long input sequence (eg. a review or article).

ℹ️ About:

The code is written in Python and uses TensorFlow, a deep learning framework created by Google. We based our implementation on the model architecture from the paper with a few minor tweaks to accommodate for limited training resources. The architecture is described in the next section.

:shipit: Model architecture:

Instead of using a standard RNN based approach, the Transformer used Attention Mechanisms to process the input sequences and learn a semantic mapping between the source sentence (in our case English) and the target sentence (German). It can be viewed as a two piece architecture consisting of an encoder and decoder. Both have the same input processing, an embedding layer followed by positional encoding. Because we aren't using RNNs which naturally encode sequential relations (this makes them slow), we have to manually encode them. This is done via sine-wave transformations that add an intinsic temporal component to the data while allowing it to be processed in whole at any given moment.

  • Encoder architecture:

    • The encoder consists of a variable number of Multi-Head Scaled Dot-Product Attention Blocks (MHDPA), which takes the position encoded data and extracts a key, query, value tuple from it, encodes relations using scaled dot-product attention and outputs a processed sequence.
  • Decoder architecture:

    • The decoder consists of a variable number of a stacked pair of Masked MHDPA and MHDPA blocks. The same type of processing is done here as in the encoder the only difference being that the key, value pair from the encoder is fed into the MHDPA block and the query is taken from the Masked MHDPA block.

Outputs from the decoder pass through a linear layer with a softmax activation to produce the output sequence.

Model

💻 Running the code:

Neural translation:

Get the dataset by running:

wget -qO- --show-progress https://wit3.fbk.eu/archive/2016-01//texts/de/en/de-en.tgz | tar xz; mv de-en data

Hyperparameters are set via flags passed to the train_translation.py script in the project root. There are reasonable defaults set so you can just run python train_trainslation.py to start training.

Text summarization:

Get the dataset from Kaggle. Hyperparameters are set via flags passed to the train_summarization.py script in the project root. There are reasonable defaults set so you can just run python train_summarization.py to start training.

NOTE:

Training can take a long time (24h+) even on a very powerful PC.

🎓 Authors: