Skip to content

BERT LanguageModels

CENL-AI-WG edited this page Mar 5, 2021 · 4 revisions

BERT Language Models Statut

Training a transformer-based model for a specific language

Keywords: BERT

Approaches: Transformers

Tools: Tensor Processing Unit (TPU), Huggingface Transformers, Pytorch


BERT (Bidirectional Encoder Representations from Transformers) is a Transformer-based machine learning technique for natural language processing (NLP), developed by Google (2018-). BERT is a bidirectional (it takes into account the context for each occurrence of a given word), unsupervised language representation, pre-trained on large plain text corpora.

Goals

BERT language models can be applied to natural language understanding tasks. Libraries end use case is finetuning these language models on specific tasks like summarizing large documents, classification, named entities recognition, etc.

Educational resources

National Library of Norway

National Library of Norway released a NB-BERT-Base model. This is based on the same structure as BERT Cased multilingual model, and is trained on a wide variety of Norwegian text (both bokmål and nynorsk) from the last 200 years. The model have been evaluated on NER/POS-tasks but it is expected that the model also perform good on other tasks.

National Library of Sweden

The National Library of Sweden/KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text.

A complete description is available in this paper: "Playing with words at the National Library of Sweden - Making a Swedish BERT".

Other resources

...

Implementations

This Recipe shows how the Swedish BERT model have been used for a NER use case.