EsperBERTo

A test of the Attention Is Off By One hypothesis with RoBERTA and Esperanto.

Dataset

The dataset is the Esperanto portion of the OSCAR corpus from INRIA, which is a part of Common Crawl. Additionally, the dataset contains the Esperanto sub-corpus of the Leipzig Corpora Collection. In particular, along with OSCAR, I use the following epo_*-sentences.txt files from the Leipzig Corpora:

Dataset	Year	# of Sentences
OSCAR	2020	974k
LCC - Literature	2011	300k
LCC - Mixed	2012	1M
LCC - Newscrawl	2017	1M
LCC - Web	2012	1M
LCC - Wikipedia	2021	300k
total	-	4.57M

The dataset is 473 MB and is available on Hugging Face at chriswmurphy/esperanto.

Model

The models are RoBERTa with 84M parameters. The first model is a baseline that uses the default softmax in its Attention mechanism, and a challenger that instead uses the proposed softmax1. The models are also available on Hugging Face at chriswmurphy/esperberto-softmax0 and chriswmurphy/esperberto-softmax1. The idea to use RoBERTa with Esperanto came from this blog post.

Results

Training

As expected, softmax1 does not impact model performance at single-precision.

Model	Loss	Runtime	Cost
EsperBERTo w/ softmax0	4.46	9h 16m	$11.22
EsperBERTo w/ softmax1	4.44	9h 16m	$11.24

Kurtosis - Weights

Here we report the average excess kurtosis in the Attention output weights from our initial run. The weights in the dense Attention layers are Gaussian to a good approximation.

Model	Dense Weight	Dense Bias	LayerNorm Weight	LayerNorm Bias
EsperBERTo w/ softmax0	$0.031 \pm 0.021$	$0.6 \pm 1.1$	$6.8 \pm 3.9$	$0.5 \pm 0.8$
EsperBERTo w/ softmax1	$0.040 \pm 0.027$	$1.4 \pm 1.4$	$4.4 \pm 2.9$	$1.9 \pm 2.3$

Kurtosis - Activations

Finally, we report the average excess kurtosis in the Attention output activations from our initial run. Once again, there is no meaningful difference between the softmax0 and softmax1 models here, and the kurtosis in the activation of the Attention output is consistent with being Gaussian.

Model	Dense	Dropout	Output (LayerNorm)
EsperBERTo w/ softmax0	$0.74 \pm 0.29$	$1.15 \pm 0.32$	$0.77 \pm 1.35$
EsperBERTo w/ softmax1	$1.96 \pm 1.58$	$2.51 \pm 1.76$	$0.90 \pm 1.31$

Running

I'm running on an AWS g5.2xlarge EC2 instance with 1x Nvidia A10G GPU.

You can use the following script to reproduce my results: screen -S run "bash runner.sh" Don't forget to add your Hugging Face token and username to the env vars before running, e.g.

echo "HUGGINGFACE_TOKEN=<mySecretTokenVariable>" > .env
echo "HUGGINGFACE_USER=<myHFUserName>" >> .env

To test before running in earnest, add the test_pipeline flag, e.g. python train_model.py --test_pipeline. To run the unit tests do pytest tests.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
results		results
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
runner.sh		runner.sh
train_model.py		train_model.py
train_tokenizer.py		train_tokenizer.py
upload_dataset.py		upload_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EsperBERTo

Dataset

Model

Results

Training

Kurtosis - Weights

Kurtosis - Activations

Running

About

Releases 2

Packages

Languages

softmax1/EsperBERTo

Folders and files

Latest commit

History

Repository files navigation

EsperBERTo

Dataset

Model

Results

Training

Kurtosis - Weights

Kurtosis - Activations

Running

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages