BERT LanguageModels

BERT Language Models

Training a transformer-based model for a specific language

Keywords: BERT

Approaches: Transformers

Tools: Tensor Processing Unit (TPU), Huggingface Transformers, Pytorch

BERT (Bidirectional Encoder Representations from Transformers) is a Transformer-based machine learning technique for natural language processing (NLP), developed by Google (2018-). BERT is a bidirectional (it takes into account the context for each occurrence of a given word), unsupervised language representation, pre-trained on large plain text corpora.

Goals

BERT language models can be applied to natural language understanding tasks. Libraries end use case is finetuning these language models on specific tasks like summarizing large documents, classification, named entities recognition, etc.

Educational resources

National Library of Norway

National Library of Norway released a NB-BERT-Base model. This is based on the same structure as BERT Cased multilingual model, and is trained on a wide variety of Norwegian text (both bokmål and nynorsk) from the last 200 years. The model have been evaluated on NER/POS-tasks but it is expected that the model also perform good on other tasks.

github
Presentation of the Colossal Norwegian Corpus

National Library of Sweden

The National Library of Sweden/KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text.

A complete description is available in this paper: "Playing with words at the National Library of Sweden - Making a Swedish BERT".

Other resources

...

Implementations

This Recipe shows how the Swedish BERT model have been used for a NER use case.

NLP

HTR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly