This project aims to implement the Fastformer architecture as proposed in the paper Fastformer: Additive Attention Can Be All You Need. The Fastformer architecture is designed to be a more efficient alternative to the traditional Transformer models, particularly by speeding up the attention mechanism.
-
Data Preparation: Preprocessing and tokenization of the AG_NEWS dataset using TorchText.
-
Model Architecture: Implementation of the Fastformer and traditional Transformer models.
-
Training: Training loop, including hyperparameter settings and optimization routines.
-
Evaluation: Performance metrics and comparisons between Fastformer and traditional Transformer.
The notebook includes a comparison between the Fastformer and traditional Transformer models. Preliminary results suggest that Fastformer runs slightly faster but the epoch-to-epoch training time can vary.