Several months ago, we released our BERT model pre-trained with our own web corpus. This time we decided to release a DistilBERT model. According to the paper of DistilBERT, the DistilBERT model can reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. Therefore, to show the above advantages and the quality of our model, we list the performance evaluation of two down-stream tasks as well as the inference time in sections below. It turned out that for a classification task, our DistilBERT model retains 98% of the BERT's performance, while saving 58% of the inference time. For another question answering task, the DistilBERT model retains 90% of the BERT's performance, while saving 47% of the inference time.
Although it is possible to distill from the BERT model we released, we chose to pre-train a new BERT base model and use the new model as the teacher model to train our DistilBERT mdoel. The main difference between the released BERT and the new BERT model is the corpus they are pre-trained on. Hoping to improve the performance, this time we used a Japanese corpus extracted from Common Crawl (CC) data by using Extractor, and then cleaned the data in the same fashion Google prepared the C4 corpus.
Download the base DistilBERT model from here.
The DistilBERT model has been evaluated on two tasks, Livedoor news classification task and driving-domain question answering (DDQA) task. In Livedoor news classification, each piece of news is supposed to be classified into one of nine categories. In DDQA task, given question-article pairs, answers to the questions are expected to be found from the articles. The results of the evaluation are shown below, in comparison with a baseline model trained by Bandai Namco, whose teacher model is the pretrained Japanese BERT models from TOHOKU NLP LAB. Note that due to the small size of the evaluation datasets, the results may differ a little every time.
For Livedoow news classification task:
model name | model size | corpus | corpus size | eval evironment | batch size | epoch | learning rate | accuracy (%) | performance retaining rate |
---|---|---|---|---|---|---|---|---|---|
Laboro-BERT-teacher | Base | clean CC corpus | 13G | GPU | 4 | 10 | 5e-5 | 97.90 | - |
Laboro-DistilBERT | Base | clean CC corpus | 13G | GPU | 4 | 10 | 5e-5 | 95.86 | 98% |
BandaiNamco-DistilBERT | Base | JA-Wikipedia | 2.9G | GPU | 4 | 10 | 5e-5 | 94.03 | - |
For Driving-domain QA task:
model name | model size | corpus | corpus size | eval evironment | batch size | epoch | learning rate | accuracy (%) | performance retaining rate |
---|---|---|---|---|---|---|---|---|---|
Laboro-BERT-teacher | Base | clean CC corpus | 13G | TPU | 32 | 3 | 5e-5 | 75.21 | - |
Laboro-DistilBERT | Base | clean CC corpus | 13G | GPU | 32 | 3 | 5e-5 | 67.92 | 90% |
BandaiNamco-DistilBERT | Base | JA-Wikipedia | 2.9G | GPU | 32 | 3 | 5e-5 | 52.60 | - |
We also tested the inference time of the two tasks. Because the inference time is influenced by the status of the GPU or TPU and is not very stable, we calculated the average time over 5 times of measurement.
model name | task | eval evironment | number of examples | ave inf. time (seconds) | time saving rate |
---|---|---|---|---|---|
Laboro-BERT-teacher | Livedoor | Tesla T4 GPU | 1473 | 65.4 | - |
Laboro-DistilBERT | Livedoor | Tesla T4 GPU | 1473 | 27.2 | 58% |
BandaiNamco-DistilBERT | Livedoor | Tesla T4 GPU | 1473 | 28 | - |
Laboro-BERT-teacher | DDQA | TPU | 1042 | 39.2 | - |
Laboro-DistilBERT | DDQA | Tesla T4 GPU | 1042 | 20.8 | 47% |
BandaiNamco-DistilBERT | DDQA | Tesla T4 GPU | 1042 | 20.8 | - |
We haven't published any paper on this work. Please cite this repository:
@article{Laboro DistilBERT Japanese,
title = {Laboro DistilBERT Japanese},
author = {"Zhao, Xinyi and Hamamoto, Masafumi and Fujihara, Hiromasa"},
year = {2020},
howpublished = {\url{https://github.com/laboroai/Laboro-DistilBERT-Japanese}}
}
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
For commercial use, please contact Laboro.AI Inc.
- To load the pre-trained DistilBERT model, run
from transformers import DistilBertModel
# DistilBertModel can be replaced by AutoModel
pt_model = DistilBertModel.from_pretrained('laboro-ai/distilbert-base-japanese')
- To load the SentencePiece tokenizer, run
from transformers import AlbertTokenizer
# AlbertTokenizer can't be replaced by AutoTokenizer
sp_tokenizer = AlbertTokenizer.from_pretrained('laboro-ai/distilbert-base-japanese')
- To load the DistilBERT model fine-tuned on Livedoor News corpus, run
from transformers import DistilBertForSequenceClassification
model = DistilBertForSequenceClassification.from_pretrained('laboro-ai/distilbert-base-japanese-finetuned-livedoor')
- To load the DistilBERT model fine-tuned on DDQA corpus, run
from transformers import DistilBertForQuestionAnswering
model = DistilBertForQuestionAnswering.from_pretrained('laboro-ai/distilbert-base-japanese-finetuned-ddqa')
* Note that the model on Huggingface Model Hub is also released under CC-BY-NC-4.0 license.
To fine-tune our DistilBERT model, download the model and tokenizer from here, and then put everything in ./model/laboro_distilbert/
directory.
Similarly, to fine-tune the model trained by Bandai Namco, follow their instruction here to download the model and vocab file, and then put everything in ./model/namco_distilbert/
directory.
Text classification means assigning labels to text. Because the labels can be defined to describe any aspect of the text, text classification has a wide range of application. The most straightforward one would be categorizing the topic or sentiment of the text. Besides those, other examples include recognizing spam email, judging whether two sentences have same or similar meaning.
In the evaluation of English language models in classification task, several datasets (e.g. SST-2, MRPC) can be used as common benchmarks. As for Japanese DistilBERT model, Livedoor news corpus can be used in the same fashion. Each piece of news in this corpus can be classified into one of the nine categories.
The original corpus is not devided in training, evaluation, and testing data. In this repository, we provide index lists of the dataset used in our experiments. Our dataset is pre-processed based on Livedoor News Corpus in following steps:
- concatenating all of the data
- shuffling randomly
- deviding into train:dev:test = 2:2:6
- Python 3.6
- Transformers==3.4.0
- torch==1.7.0
- sentencepiece>=0.1.85
- GPU is recommended
- (for Bandai Namco model) mecab-python3 with MeCab and its dictionary mecab-ipadic-2.7.0-20070801 installed
The Livedoor News corpus can be downloaded from here.
To use the same dataset splits as we used in our experiments, follow the instruction here.
The dataset should be put in ./data/livedoor/
directory.
-
Our DistilBERT
SentencePiece tokenizer should be used.
To pre-tokenize Livedoor dataset, use
./src/data_prep/sentencepiece_tokenize_livedoor.ipynb
. -
Bandai Namco DistilBERT
First follow
./src/data_prep/gen_vocab_lower.ipynb
to generate lowercase vocab file for later use.Mecab tokenizer with mecab-ipadic-NEologd dictionary should be used first, then transformers WordPiece tokenizer.
To pre-tokenize Livedoor dataset, use
./src/data_prep/mecab_wordpiece_tokenize_livedoor.ipynb
.
-
Our DistilBERT
Rename the file
config_livedoor.json
asconfig.json
, and don't forget to backup the originalconfig.json
.To fine-tune the model for Livedoor task, use
./src/laboro_distilbert/finetune-livedoor.ipynb
. -
Bandai Namco DistilBERT
Rename the file
config_livedoor.json
asconfig.json
, and don't forget to backup the originalconfig.json
.To fine-tune the model for Livedoor task, use
./src/namco_distilbert/finetune-livedoor.ipynb
.
Question answering is another task to evaluate and apply language models. In English NLP, SQuAD is one the of most widely used datasets for this task. In SQuAD, questions and corresponding Wikipedia pages are given, and the answers to the questions are supposed to be found from the Wikipedia pages.
For QA task, we used Driving Domain QA dataset for evaluation. This dataset consists of PAS-QA dataset and RC-QA dataset. So far, we have only evaluated our model on the RC-QA dataset. The dataset is already in the format of SQuAD 2.0, so no pre-processing is needed for further use.
- Python 3.6
- Transformers==3.4.0
- torch==1.7.0
- sentencepiece>=0.1.85
- GPU is recommended
- (for Bandai Namco model) mecab-python3 with MeCab and its dictionary mecab-ipadic-2.7.0-20070801 installed
Download DDQA dataset following the instruction here.
Since we only use the data in RC-QA
dataset, please copy the dataset into ./data/ddqa/RC-QA/
directory.
-
Our DistilBERT
To pre-tokenize DDQA dataset, use
./src/data_prep/sentencepiece_tokenize_ddqa.ipynb
. -
Bandai Namco DistilBERT
First follow
./src/data_prep/gen_vocab_lower.ipynb
to generate lowercase vocab file for later use.To pre-tokenize DDQA dataset, use
./src/data_prep/mecab_wordpiece_tokenize_ddqa.ipynb
.
-
Our DistilBERT
Rename the file
config_ddqa.json
asconfig.json
, and don't forget to backup the originalconfig.json
.To fine-tune the model for DDQA task, use
./laboro_distilbert/finetune-ddqa.ipynb
. -
Bandai Namco DistilBERT
Rename the file
config_ddqa.json
asconfig.json
, and don't forget to backup the originalconfig.json
.To fine-tune the model for DDQA task, use
./namco_distilbert/finetune-ddqa.ipynb
.