MLCube integration with Bert #632

davidjurado · 2023-03-30T15:27:54Z

MLCube for Bert

MLCube™ GitHub repository. MLCube™ wiki.

Project setup

An important requirement is that you must have Docker installed.

# Create Python environment and install MLCube Docker runner 
virtualenv -p python3 ./env && source ./env/bin/activate && pip install pip==24.0 && pip install mlcube-docker
# Fetch the implementation from GitHub
git clone https://github.com/mlcommons/training && cd ./training/language_model/tensorflow/bert
git fetch origin pull/632/head:mlcube_bert && git checkout mlcube_bert && cd mlcube

Go to mlcube directory and study what tasks MLCube implements.

cd ./mlcube
mlcube describe

Demo execution

These tasks will use a demo dataset to execute a faster training workload for a quick demo (~8 min):

mlcube run --task=download_demo -Pdocker.build_strategy=always

mlcube run --task=demo -Pdocker.build_strategy=always

It's also possible to execute the two tasks in one single instruction:

mlcube run --task=download_demo,demo -Pdocker.build_strategy=always

MLCube tasks

Download dataset.

mlcube run --task=download_data -Pdocker.build_strategy=always

Process dataset.

mlcube run --task=process_data -Pdocker.build_strategy=always

Train SSD.

mlcube run --task=train -Pdocker.build_strategy=always

Run compliance checker.

mlcube run --task=check_logs -Pdocker.build_strategy=always

Execute the complete pipeline

You can execute the complete pipeline with one single command.

mlcube run --task=download_data,process_data,train,check_logs -Pdocker.build_strategy=always

TPU Training

For executing this benchmark using TPU you will need access to Google Cloud Platform, then you can create a project (Note: all the resources should be created in the same project) and after that, you will need to follow the next steps:

Create a TPU node

In the Google Cloud console, search for the Cloud TPU API page, then click Enable.

Then go to the virtual machine sections and select TPUs

Select create TPU node, fill in all the needed parameters, the recommended TPU type in the readme is v3-128 and the recommended TPU software version is 2.4.0.

The 3 most important parameters you need to remember are: project name, TPU name, and TPU Zone.

After creating, click on the TPU name to see the TPU details, and copy the Service account (should int the format: service-xxxxxxxxxxxx@cloud-tpu.iam.gserviceaccount.com)

Create a Google Storage Bucket

Go to Google Storage and create a new Bucket, define the needed parameters.

In the bucket list select the checkbox for the bucket you just created, then click on permissions, after that click on add principal.

In the new principals field paste the Service account from step 1, and then for the roles select, Storage Legacy Bucket Owner, Storage Legacy Bucket Reader and Storage Legacy Bucket Writer. Then click on save, this will allow the TPU to save the checkpoints during training.

Create a VM instance

The idea is to create a virtual machine instance containing all the code we will execute using MLCube.

Go to VM instances, then click on create instance and define all the needed parameters (No GPU needed).

IMPORTANT: In the section Identity and API access, check the option Allow full access to all Cloud APIs, this will allow the connection between this VM, the Cloud Storage Bucket and the TPU.

Start the VM, connect to it via SSH, then use this tutorial to install Docker.

After installing Docker, clone the repo and install MLCube and follow the to install MLCube, then go to the path: training/language_model/tensorflow/bert/mlcube

There modify the file at workspace/parameters.yaml and replace it with your data for:

output_gs: your_gs_bucket_name
tpu_name: your_tpu_instance_name
tpu_zone: your_tpu_zone
gcp_project: your_gcp_project

After that run the command:

mlcube run --task=train_tpu --mlcube=mlcube_tpu.yaml -Pdocker.build_strategy=always

This will start the MLCube task that internally in the host VM will send a gRPC with all the data to the TPU through gRPC, then the TPU will get the code to execute and the information of the Cloud Storage Bucket data and will execute the training workload.

github-actions · 2023-03-30T15:28:12Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

arjunsuresh

FROM tensorflow/tensorflow:2.4.0-gpu

The current code is not working with the latest tensorflow (2.12) right? When we tried only the CPU execution is happening (no GPU)

davidjurado · 2023-05-10T21:08:41Z

Hello @arjunsuresh

Could you please provide more information about the issue you are facing, maybe an screenshot or the error message your are getting, thanks!

arjunsuresh · 2023-05-10T21:26:26Z

Hi @davidjurado
Actually I was asking if there is a reason why tensorflow2.4 is used here and not the latest one (2.12)?
https://github.com/mlcommons/training/pull/632/files#diff-5c97df48019afb46e0bdfc9099f9dd798a6448ff672b1a1ec1f664500f8409d9

davidjurado · 2023-05-18T15:34:46Z

Hi @davidjurado Actually I was asking if there is a reason why tensorflow2.4 is used here and not the latest one (2.12)? https://github.com/mlcommons/training/pull/632/files#diff-5c97df48019afb46e0bdfc9099f9dd798a6448ff672b1a1ec1f664500f8409d9

Hello @arjunsuresh,

That's because in the original readme they mentioned that the model was tested using this Tensorflow version, I tested the newer versions and it works as expected, I'm planning to update the Docker base image to match the newest version, thanks.

nv-rborkar · 2024-03-08T03:34:58Z

@sgpyc as reference owner could you please review this MR so that we can merge it.

nv-rborkar · 2024-04-04T15:49:42Z

@sgpyc has reviewed & looks ok.

davidjurado mentioned this pull request Mar 30, 2023

MLCube packing: Bert benchmark [WIP] #503

Closed

davidjurado force-pushed the mlcube_bert branch from 3958143 to 558db03 Compare April 13, 2023 14:11

davidjurado changed the title ~~[WIP] MLCube integration with Bert~~ MLCube integration with Bert Apr 13, 2023

davidjurado requested a review from a team as a code owner April 28, 2023 15:55

arjunsuresh reviewed May 8, 2023

View reviewed changes

davidjurado force-pushed the mlcube_bert branch from d10b209 to 7a5eb75 Compare May 18, 2023 15:41

davidjurado force-pushed the mlcube_bert branch from 53ae9e1 to 21d15d8 Compare June 8, 2023 20:04

davidjurado force-pushed the mlcube_bert branch 2 times, most recently from 23c1d0a to 57ab273 Compare September 22, 2023 16:07

nv-rborkar previously approved these changes Apr 4, 2024

View reviewed changes

davidjurado added 12 commits May 10, 2024 07:44

Add MLCube integration with Bert

396ecf0

Fix training and check logs scripts

049f74f

Fix evaluation in MLCube

9275951

Fix merge

5194bc6

Add TPU logic for MLCube

ad91b9a

temp email

83ce9b0

Fix file logic

2ca5aed

Update TPU implementation

174d0d4

Fix TPU implementation

b8dbb7c

Fix create pretraining data script

b51781a

Fix training script for gpu

478cbfb

Add small demo

3cedb4b

davidjurado dismissed nv-rborkar’s stale review via 3cedb4b May 10, 2024 15:30

davidjurado force-pushed the mlcube_bert branch from 57ab273 to 3cedb4b Compare May 10, 2024 15:30

update demo download link

3283fc3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLCube integration with Bert #632

MLCube integration with Bert #632

davidjurado commented Mar 30, 2023 •

edited

Loading

github-actions bot commented Mar 30, 2023 •

edited

Loading

arjunsuresh left a comment

davidjurado commented May 10, 2023 •

edited

Loading

arjunsuresh commented May 10, 2023

davidjurado commented May 18, 2023 •

edited

Loading

nv-rborkar commented Mar 8, 2024

nv-rborkar commented Apr 4, 2024

MLCube integration with Bert #632

Are you sure you want to change the base?

MLCube integration with Bert #632

Conversation

davidjurado commented Mar 30, 2023 • edited Loading

MLCube for Bert

Project setup

Demo execution

MLCube tasks

Execute the complete pipeline

TPU Training

github-actions bot commented Mar 30, 2023 • edited Loading

arjunsuresh left a comment

Choose a reason for hiding this comment

davidjurado commented May 10, 2023 • edited Loading

arjunsuresh commented May 10, 2023

davidjurado commented May 18, 2023 • edited Loading

nv-rborkar commented Mar 8, 2024

nv-rborkar commented Apr 4, 2024

davidjurado commented Mar 30, 2023 •

edited

Loading

github-actions bot commented Mar 30, 2023 •

edited

Loading

davidjurado commented May 10, 2023 •

edited

Loading

davidjurado commented May 18, 2023 •

edited

Loading