-
Notifications
You must be signed in to change notification settings - Fork 0
BERT LanguageModels
Training a transformer-based model for a specific language
Keywords: BERT
Approaches: Transformers
Tools: Tensor Processing Unit (TPU), Huggingface Transformers, Pytorch
BERT (Bidirectional Encoder Representations from Transformers) is a Transformer-based machine learning technique for natural language processing (NLP), developed by Google (2018-). BERT is a bidirectional (it takes into account the context for each occurrence of a given word), unsupervised language representation, pre-trained on large plain text corpora.
BERT language models can be applied to natural language understanding tasks. Libraries end use case is finetuning these language models on specific tasks like summarizing large documents, classification, named entities recognition, etc.
National Library of Norway released a NB-BERT-Base model. This is based on the same structure as BERT Cased multilingual model, and is trained on a wide variety of Norwegian text (both bokmål and nynorsk) from the last 200 years. The model have been evaluated on NER/POS-tasks but it is expected that the model also perform good on other tasks.
- github
- Presentation of the Colossal Norwegian Corpus
The National Library of Sweden/KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text.
A complete description is available in this paper: "Playing with words at the National Library of Sweden - Making a Swedish BERT".
...
This Recipe shows how the Swedish BERT model have been used for a NER use case.