Bert-Large Model Training using Intel® Extension for TensorFlow.
Use Case | Framework | Model Repo | Branch/Commit/Tag | Optional Patch |
---|---|---|---|---|
training | TensorFlow | NVIDIA/DeepLearningExamples | master | bert-large-itex-bf16.patch |
Note: Refer to CONTAINER.md for BERT-Large training instructions using docker containers.
-
Host has Intel® Data Center GPU Max.
-
Host has installed latest Intel® Data Center GPU Max Series Driver https://dgpu-docs.intel.com/driver/installation.html
-
The following Intel® oneAPI Base Toolkit components are required:
- Intel® oneAPI DPC++ Compiler (Placeholder DPCPPROOT as its installation path)
- Intel® oneAPI Math Kernel Library (oneMKL) (Placeholder MKLROOT as its installation path)
Recommend to use a package manager like apt, yum or dnf to install the packages above. Follow instructions at Intel® oneAPI Base Toolkit Download page to setup the package manager repository.
-
Prapare total dataset from scratch. NV-bert repository provides scripts to download, verify, and extract the SQuAD dataset and pretrained weights for fine-tuning as well as Wikipedia and BookCorpus dataset for pre-training.
-
You can run below scripts to download datasets for fine-tuning and pretraining. Remember to modify the environment varible
BERT_PREP_WORKING_DIR
indata/create_datasets_from_start.sh
to your real data dir.git clone https://github.com/NVIDIA/DeepLearningExamples.git cd DeepLearningExamples/TensorFlow2/LanguageModeling/BERT bash data/create_datasets_from_start.sh pretrained
-
Then use below scripts and patch to create small dataset for bert large training performance.
git clone https://github.com/IntelAI/models.git cp models/models_v2/tensorflow/bert_large/training/gpu/bert-large-itex-bf16.patch . git am --signoff < bert-large-itex-bf16.patch bash data/create_small_dataset_for_benchmark.sh [dataset_output_dir](default: ${PWD}/dataset)
The dataset structure will be:
- [dataset_output_dir]
- training/
- test/
- uncased_L-24_H-1024_A-16/
-
If not already cloned, clone Intel® AI Reference Model repository
git clone https://github.com/IntelAI/models.git
-
cd models/models_v2/tensorflow/bert_large/training/gpu
-
Create virtual environment
venv
and activate it:python3 -m venv venv . ./venv/bin/activate
-
Run setup.sh
./setup.sh
-
Install tensorflow and ITEX
-
Setup required environment paramaters
Parameter export command DATA_DIR export DATA_DIR=/the/path/to/dataset
RESULTS_DIR export RESULTS_DIR=/the/path/to/result_dir
DATATYPE export DATATYPE=bf16
(bf16, tf32 or fp32)MULTI_TILE export MULTI_TILE=False
(provide True for multi-tile GPU such as Max 1550, and False for single-tile GPU such as Max 1100)NUM_DEVICES export NUM_DEVICES=<num_devices>
(<num_devices>
is the number of GPU devices to use for training. It must be equal to or smaller than the number of GPU devices attached to each node. For GPU with 2 tiles, such as Max 1550 GPU, the number of GPU devices in each node is 2 times the number of GPUs, so<num_devices>
can be set as <=16 for a node with 8 Max 1550 GPUs. While for GPU with single tile, such as Max 1100 GPU, the number of GPU devices available in each node is the same as number of GPUs, so<num_devices>
can be set as <=8 for a node with 8 Max 1100 GPUs.) -
Run
run_model.sh
Single-device output will typically look like:
I0907 03:06:10.519348 140707593221952 model_training_utils.py:83] Training Summary:
{'total_training_steps': 8, 'train_loss': 10.389582633972168}
I0907 03:06:10.519572 140707593221952 model_training_utils.py:595] -----------------------------
I0907 03:06:10.519612 140707593221952 model_training_utils.py:596] Batch size = 32
I0907 03:06:10.519635 140707593221952 model_training_utils.py:597] Num steps = 8
I0907 03:06:10.519656 140707593221952 model_training_utils.py:598] LR = 0.0005
I0907 03:06:10.519677 140707593221952 model_training_utils.py:602] Total Training Time = 339.25 for Sequences = 7680
I0907 03:06:10.519717 140707593221952 model_training_utils.py:604] Throughput Average (sequences/sec) with overhead = 22.64
I0907 03:06:10.520012 140707593221952 model_training_utils.py:606] Throughput Average (sequences/sec) = 99.05
I0907 03:06:10.520030 140707593221952 model_training_utils.py:607] -----------------------------
decayed_learning_rate_at_crossover_point = 5.000000e-04, adjusted_init_lr = 5.000000e-04
DLL 2023-09-07 03:06:10.520066 - throughput_train : 99.048 sequences/s
DLL 2023-09-07 03:06:10.520158 - total_loss : 10.3896
Multi-device output will typically look like:
I1030 10:59:04.732321 139681571383104 model_training_utils.py:83] Training Summary:
{'total_training_steps': 8, 'train_loss': 10.069790840148926}
I1030 10:59:04.732495 139681571383104 model_training_utils.py:595] -----------------------------
I1030 10:59:04.732545 139681571383104 model_training_utils.py:596] Batch size = 32
I1030 10:59:04.732575 139681571383104 model_training_utils.py:597] Num steps = 8
I1030 10:59:04.732604 139681571383104 model_training_utils.py:598] LR = 0.0005
I1030 10:59:04.732630 139681571383104 model_training_utils.py:600] Multi-GPU training with TF Horovod
I1030 10:59:04.732667 139681571383104 model_training_utils.py:601] hvd.size() = 2
I1030 10:59:04.732693 139681571383104 model_training_utils.py:602] Total Training Time = 279.63 for Sequences = 15360
I1030 10:59:04.732739 139681571383104 model_training_utils.py:604] Throughput Average (sequences/sec) with overhead = 54.93
I1030 10:59:04.732893 139681571383104 model_training_utils.py:606] Throughput Average (sequences/sec) = 197.41
I1030 10:59:04.732919 139681571383104 model_training_utils.py:607] -----------------------------
decayed_learning_rate_at_crossover_point = 1.000000e-03, adjusted_init_lr = 1.000000e-03
DLL 2023-10-30 10:59:04.732944 - throughput_train : 197.408 sequences/s
DLL 2023-10-30 10:59:04.733007 - total_loss : 10.0698
2023:10:30-10:59:08:(273540) |CCL_INFO| finalizing level-zero
2023:10:30-10:59:08:(273540) |CCL_INFO| finalized level-zero
Final results of the training can be found in results.yaml
file.
results:
- key: throughput
value: 99.048
unit: sequences/s