Skip to content

Q4-2024 Bachelor Research Project: Architectural Decisions for Language Modelling with (Small) Transformers

Notifications You must be signed in to change notification settings

AISE-TUDelft/tiny-transformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tiny Transformers

2024/Q4 Bachelor Research Project: Architectural Decisions for Language Modelling with (Small) Transformers, where we look at sample-efficient pre-training techniques at a small scale.

Posters and Papers

For context on the five transformer components studied, see the Project Description. One sentence synopsis: We train small BERT/GPT models (~10M parameters) on a collection of short children stories, and compare which techniques yield better performance on the GLUE/BLiMP benchmarks.

(Adaptive) Activation Functions
Filip Ignijić
Attention Heads & Dimensions
Khalit Gulamov
Infini-Attention at Its Limits
Lauri Kesküll
Sample-Efficient Tokenisation
Rafael Borges
Sparsity at a Fixed Model Size
Eugene Wu

Links

Guides

See guides for tutorials on remote development and implementation details.

Remote Development
Practical Information

Resources

See code/common for shared resources on tooling and training.

Tooling
Training

Architectural Decisions for Language Modelling with (Small) Transformers

Description from ProjectForum

Prerequisites

  • Motivation to learn (i.e. read papers) about state-of-the-art natural-language modelling techniques.
  • Strong Python programming skills.
  • Comfortable with shell environments (for submitting jobs to the supercomputer).
  • You will need to remember some of your Linear Algebra and Calculus courses (matrices and gradient descent).
  • Knowledge of deep learning libraries, like transformers, pytorch, tensorflow, is a big plus!

Introduction

Language models (LMs) based on the transformer architecture [1] have, well, transformed the language processing domain. State-of-the-art large LMs are exposed to several orders of magnitude more data than a human, yet are certainly not leveraging all of this information. Given the current trend of exponentially-increasing model sizes, we are predicted to run out of high-quality textual data by 2026, which LMs tend to perform best on [2].

This project studies the effect of architectural decisions on small LMs, with the overarching goal of increasing their sample-efficiency and minimising their parameter count. Recent studies [3, 4] have shown smaller models (≤33M parameters) can exhibit equivalent language understanding of their larger counterparts, and similarly display the desired emergent properties (e.g. reasoning and creativity) that drive their prevalence. This makes them ideal for exploring architectures in a compute-limited setting; and their small scale makes individual components more interpretable [3]. Additionally, small LMs allow local deployment, leading to better support for privacy, personalisation, and democratisation.

Current research lacks precise understanding of how architectural decisions influence natural language understanding and fine-tuned task performance in small LMs. While Eldan & Li [3] show that small LMs can exhibit linguistic understanding greater than GPT-2 (125M parameters) by reducing the breadth of their training data, they miss quantitative evaluations in a down-stream, applied setting. Warstadt et al. [4] find that architectural optimisations are the most promising research direction for small LMs, but these decisions are rarely surveyed across different models. Moreover, research that applies LMs to downstream tasks rarely considers hyperparameters beyond the default, let alone architectural decisions [5].

Introducing RQ Terms

This project studies small LMs, particularly the impact of architectural decisions in the transformer blocks that perform the bulk of information integration. A block takes as input a sequence of embedded tokens, each represented by a vector of length defined by the model's hidden size. These inputs are transformed by the self-attention module, followed by a feed-forward network (FFN). The attention mechanism [6] itself consists of a number of heads, each of which match tokens' queries against other tokens' keys, to integrate information from the latter with the former. The FFNs (multi-layer perceptrons), allow the model to learn more complex, non-linear patterns. While this is the motivation for their introduction, it is worth noting that the exact way these two modules combine to form complex generated text is still not fully understood.

Goal

We explore small transformer architectures, to give context behind design decisions and their impact on language understanding in downstream tasks.

Research Questions for the Sub-Projects

Following are some preliminary research questions based on the small LM exploration by Eldan & Li [3], though students are encouraged to construct/extend their own within the same domain. The models considered for each RQ are GPT-Neo and BERT.

  1. How is model performance affected by the width of hidden layers? [3]
  2. How is model performance affected by the depth of its FFNs? [3]
  3. How is model performance affected by the number of attention heads? [3, 7]
  4. How is model performance affected by the number of transformer blocks, and the ordering of modules within them? [3, 8]
  5. How is model performance affected by the width of Query-Key vectors? [6]

Additionally, each student is expected to perform an interpretability study akin to [3] on the specific component of their RQ.

Approach

  • Base Model: GPT-Neo & BERT at 9M parameters (4 blocks, 512 hidden size, 1024 intermediate FFN size, and 10k token vocabulary). Students augment this base model along the dimension of their RQ.
  • Dataset: TinyStories, consisting of ~2M short stories written by GPT-3.5 and GPT-4.
  • Evaluation: BabyLM evaluation pipeline, consisting of BLiMP grammar test-suite and fine-tuning on SuperGLUE tasks to evaluate downstream performance. We further study the total training time, and total training samples each model sees to draw conclusions about sample-efficiency. Lastly, we study the resulting models' size and inference speed to evaluate them in an edge-device context.
  • Hardware: Due to their small size, the models can be pre-trained locally on students' laptops. Students are also able to train on the DelftBlue cluster.

Q&A Sessions


Background

LLMs' increasingly large scale stimulates increasingly more research into optimisation. One strand is to take the models as they are, and optimise them in a top-down fashion. Another strand, perhaps most interestingly, consists of a bottom-up approach: experimenting with design decisions before and during pre-training. We further motivate the need for architectural optimisation by noting that such bottom-up advances are relatively unexplored in applied settings, compared to top-down approaches like prompt-engineering. The studies by Eldan & Li [3], and Warstadt et al. [4] stimulate research in this area by highlighting the competitive performance of small LMs in resource-constrained settings.

TinyStories [3] is synthetic dataset of short childrens' stories generated by GPT-3.5 and GPT-4, consisting of ~2M short stories. By limiting the breadth of the dataset in this manner, it can be used to train and evaluate LMs with ≤33M parameters; and, these small LMs generate more coherent and grammatically-correct text than a 125M-parameter GPT-2 model. Additionally, these smaller models are more interpretable, and reveal individual functions for specific neurons (e.g. attending to the protagonist in a story).

BabyLM [4], is a communal challenge in which participants competed to optimise training on a fixed data budget. The authors provided two budgets: a dataset of 10M and 100M words, similar both in content and amount of language a 12-year old is exposed to. This has led to some fruitful developments in curriculum learning (gradually increasing the complexity of training tasks), knowledge distillation (training a small student on the outputs of a larger teacher model), and architecture optimisation.

Recommended Material for Enthusiastic Students

References

  1. A. Vaswani et al., ‘Attention Is All You Need’. arXiv, Dec. 06, 2017. Accessed: Jan. 25, 2024. [Online]. Available: http://arxiv.org/abs/1706.03762v5
  2. P. Villalobos, J. Sevilla, L. Heim, T. Besiroglu, M. Hobbhahn, and A. Ho, ‘Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning’. arXiv, Oct. 25, 2022. Accessed: Jan. 23, 2024. [Online]. Available: http://arxiv.org/abs/2211.04325
  3. R. Eldan and Y. Li, ‘TinyStories: How Small Can Language Models Be and Still Speak Coherent English?’ arXiv, May 24, 2023. Accessed: Nov. 01, 2023. [Online]. Available: http://arxiv.org/abs/2305.07759
  4. A. Warstadt et al., ‘Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora’, in Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Singapore: Association for Computational Linguistics, 2023, pp. 1–6. doi: 10.18653/v1/2023.conll-babylm.1.
  5. A. Wettig, T. Gao, Z. Zhong, and D. Chen, ‘Should You Mask 15% in Masked Language Modeling?’ arXiv, Feb. 10, 2023. Accessed: Jan. 24, 2024. [Online]. Available: http://arxiv.org/abs/2202.08005
  6. D. Bahdanau, K. Cho, and Y. Bengio, ‘Neural Machine Translation by Jointly Learning to Align and Translate’. arXiv, May 19, 2016. Accessed: Jan. 25, 2024. [Online]. Available: http://arxiv.org/abs/1409.0473
  7. P. Michel, O. Levy, and G. Neubig, ‘Are Sixteen Heads Really Better than One?’ arXiv, Nov. 04, 2019. Accessed: Jan. 21, 2024. [Online]. Available: http://arxiv.org/abs/1905.10650
  8. S. Shleifer, J. Weston, and M. Ott, ‘NormFormer: Improved Transformer Pretraining with Extra Normalization’. arXiv, Nov. 01, 2021. Accessed: Jan. 25, 2024. [Online]. Available: http://arxiv.org/abs/2110.09456

Related Reading

  • J. Hoffmann et al., ‘Training Compute-Optimal Large Language Models’. arXiv, Mar. 29, 2022. Accessed: Jan. 25, 2024. [Online]. Available: http://arxiv.org/abs/2203.15556

  • S. Gunasekar et al., ‘Textbooks Are All You Need’. arXiv, Oct. 02, 2023. Accessed: Jan. 23, 2024. [Online]. Available: http://arxiv.org/abs/2306.11644

  • L. G. G. Charpentier and D. Samuel, ‘Not all layers are equally as important: Every Layer Counts BERT’. arXiv, Nov. 07, 2023. Accessed: Jan. 25, 2024. [Online]. Available: http://arxiv.org/abs/2311.02265

  • R. D. Martinez et al., ‘CLIMB – Curriculum Learning for Infant-inspired Model Building’, in Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Singapore: Association for Computational Linguistics, 2023, pp. 84–99. doi: 10.18653/v1/2023.conll-babylm.10.

  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf, ‘DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter’, Oct. 2019, doi: 10.48550/arXiv.1910.01108.

  • J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding’. arXiv, May 24, 2019. Accessed: Jan. 25, 2024. [Online]. Available: http://arxiv.org/abs/1810.04805

  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, ‘GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding’. arXiv, Feb. 22, 2019. Accessed: Jan. 25, 2024. [Online]. Available: http://arxiv.org/abs/1804.07461

  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, ‘SQuAD: 100,000+ Questions for Machine Comprehension of Text’, in Proceedings of the 2016 Conference on Empirical Methods in Natural          Language Processing, Austin, Texas: Association for Computational Linguistics, 2016, pp. 2383–2392. doi: 10.18653/v1/D16-1264.

About

Q4-2024 Bachelor Research Project: Architectural Decisions for Language Modelling with (Small) Transformers

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published