This project aims to deepen the understanding of Transformer architectures by implementing and experimenting with different components, improving model performance, and gaining insights into the nuances of speech segment classification and language modeling tasks.
-
Part 1: Encoder Implementation and Classification
- Implement a Transformer Encoder from scratch.
- Train it jointly with a FeedForward Classifier to predict the US President who delivered a given speech segment.
- Dataset: Speech segments labeled with politicians (Barack Obama, George W. Bush, George H. Bush).
- Evaluate classifier accuracy.
-
Part 2: Decoder Implementation and Language Modeling
- Implement a Transformer Decoder with masked self-attention.
- Pretrain the Decoder on an autoregressive Language Modeling task to predict the next word in a sequence.
- Dataset: Unlabeled text from speeches.
- Report perplexity on test sets from different politicians.
-
Part 3: Architectural Exploration
- Experiment with various transformer architecture components such as positional encoding, sparse attention patterns, etc.
- Aim to improve the classifier’s accuracy or the decoder’s perplexity.
-
Classification Task:
- Accuracy on test dataset.
- Track accuracy across 15 epochs.
-
Language Modeling Task:
- Perplexity on test sets for different politicians.
- Track perplexity every 100 iterations up to 500 iterations.
- Document the implementation process, results, and insights gained.
- Include plots and visualizations for attention matrices.
- Summarize performance improvements and architectural exploration findings.
It is recommended to run the code inside a virtual environment. You can create one using the following command:
python3 -m venv ./.venv
This will create a virtual environment called .venv
. Activate the environment using the following command:
source ./.venv/bin/activate
You will need Python3, PyTorch, tqdm, Pandas, matplotlib, and JSON installed to run the code. Please install any dependencies using the following command:
python3 -m pip install -r requirements.txt
Once the dependencies are installed, please use the following commands to run different parts of the assignment:
# Classification
python3 main.py part1
# Language Modeling
python3 main.py part2
# Exploration
python3 main.py part3
In case you want to see the metrics at each iteration, please set --verbose=True
. It is False
by default. For example:
# Classification
python3 main.py part1 --verbose=True
In case you want to perform sanity check on the attention maps, please set --perform-sanity-check=True
. It is False
by default. For example:
# Classification
python3 main.py part1 --perform-sanity-check=True
Here, I implement a transformer encoder and train it jointly from scratch with a feedforward classifier for a downstream task of predicting which politician delivered a given speech segment. Following are steps involved:
-
Load text from
speechesdataset/train_CLS.tsv
andspeechesdataset/train_LM.txt
files using theload_texts()
function. -
Build a tokenizer using the
SimpleTokenizer
class, which outputs a vocabulary from the given text and encodes/decodes text into indices. -
Run the
classification_task()
function.-
Get an iterable over the train dataset (
speechesdataset/train_CLS.tsv
) and test dataset (speechesdataset/test_CLS.tsv
) using theget_cls_data_loader()
function. This usesSpeechesClassificationDataset()
andDataLoader()
functions. -
Define the
classifier
object using thetransformer.Classifier
class. This consists of the following:- A transformer
Encoder
which consists of 2 embeddings layers, followed by 4 layers of transformerBlock
s.- Each
Block
of the transformer consists of aMultiHeadAttention
layer and aFeedForward
layer, along withLayerNorm
layers and residual connections. - Each
MultiHeadAttention
layer contains 2AttentionHead
s followed by aLinear
layer. EachAttentionHead
performs the attention operation using the key (k
), query (q
), and value (v
) vectors on the input (x
) vector. - The
FeedForward
layer consists of 2Linear
layers along with aReLU
activation function to introduce non-linearity.
- Each
- 2
Linear
layers along with aReLU
activation function.
- A transformer
-
Define the
criterion
and theoptimizer
, and train and evaluate theclassifier
forepochs_cls
epochs. -
Save and output the
train_loss
,train_accuracy
, andtest_accuracy
for each epoch. -
Perform a sanity check on the attention maps using the
Utilities
class.
-
-
Write the output to a JSON file using the
write_output_to_json()
function.
Here, I implement a word-level, GPT-like transformer decoder, pretrain it on an autoregressive language modeling task, and report perplexity numbers on speeches from different politicians. Following are steps involved:
-
Load text from
speechesdataset/train_CLS.tsv
andspeechesdataset/train_LM.txt
files using theload_texts()
function. -
Build a tokenizer using the
SimpleTokenizer
class, which outputs a vocabulary from the given text and encodes/decodes text into indices. -
Run the
language_modeling_task()
function.-
Get an iterable over the train dataset (
speechesdataset/train_LM.txt
) and test datasets (speechesdataset/test_LM_hbush.txt
,speechesdataset/test_LM_obama.txt
, andspeechesdataset/test_LM_wbush.txt
) using theget_lm_data_loader()
function. This usesLanguageModelingDataset()
andDataLoader()
functions. -
Define the
decoder
object using thetransformer.Decoder
class. This consists of the following:- A transformer
Decoder
which consists of 2 embeddings layers, 4 layers of transformerBlock
s, a finalLayerNorm
layer, and aLinear
layer.- Each
Block
of the transformer consists of aMultiHeadAttention
layer and aFeedForward
layer, along withLayerNorm
layers and residual connections. - Each
MultiHeadAttention
layer contains 2AttentionHead
s followed by aLinear
layer. EachAttentionHead
performs the attention operation using the key (k
), query (q
), and value (v
) vectors on the input (x
) vector. - The
FeedForward
layer consists of 2Linear
layers along with aReLU
activation function to introduce non-linearity.
- Each
- A transformer
-
Define the
optimizer
and train thedecoder
formax_iters
epochs, evaluating after everyeval_interval
interval. -
Save and output the
train_perplexity
,hbush_test_perplexity
,obama_test_perplexity
andwbush_test_perplexity
at each interval. -
Perform a sanity check on the attention maps using the
Utilities
class.
-
-
Write the output to a JSON file using the
write_output_to_json()
function.
This involves running the classification_task()
and language_modeling_task()
functions by setting the use_alibi
argument to be True
. The steps are the same as Parts 1 and 2. The only difference is that the transformer now uses ALiBi positional embeddings instead of absolute positional embeddings. This means the following:
- The
Classifier
andBlock
classes are unchanged. - The
Encoder
and theDecoder
have only 1 embedding layer (instead of 2, as in Parts 1 and 2) each. - The
MultiHeadAttention
class has a new parameter calledm
which has a different constant value (power of 2) for each transformer head. - The
AttentionHead
class now implements ALiBi where it adds abias
matrix to the attention weights to encode position of the key (k
) vectors relative to the position of the query (q
) vector. The value (v
) vectors do not encode position information.
This involves running the classification_task()
and language_modeling_task()
functions by setting the use_init_weights
argument to be True
. The steps are the same as Parts 1 and 2. The only difference is that the transformer now initializes the weights by sampling from a normal distribution with mean of 0
and standard deviation of 0.05
. This is different from Parts 1 and 2 where the weights are initialized randomly.
The outputs are written to JSON files. Each file contains the following details:
task
: The task, whether classification or language modelingnum_params
: The number of model parametersuse_alibi
: Whether the model uses absolute positional embeddings or ABiLi positional embeddingsuse_init_weights
: Whether the model weights are initialized randomly or from a normal distributionhistory
: Rhe model history, which includes train and test metrics for each training epoch.
Following are the JSON files generated:
- Part 1:
part1_classification_task.json
- Part 2:
part2_language_modeling_task.json
- Part 3:
part3_architectural_exploration_classification_task.json
part3_architectural_exploration_language_modeling_task.json
part3_performance_improvement_classification_task.json
part3_performance_improvement_language_modeling_task.json
To regenerate the tables, plots, and other visualizations used in the Project Report, please refer to the visualize.ipynb
Jupyter notebook.
- Transformer Architecture: https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py
- ALiBi Positional Embeddings: https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py