-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLCube integration with Bert #632
base: master
Are you sure you want to change the base?
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
3958143
to
558db03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FROM tensorflow/tensorflow:2.4.0-gpu
The current code is not working with the latest tensorflow (2.12) right? When we tried only the CPU execution is happening (no GPU)
Hello @arjunsuresh Could you please provide more information about the issue you are facing, maybe an screenshot or the error message your are getting, thanks! |
Hi @davidjurado |
Hello @arjunsuresh, That's because in the original readme they mentioned that the model was tested using this Tensorflow version, I tested the newer versions and it works as expected, I'm planning to update the Docker base image to match the newest version, thanks. |
23c1d0a
to
57ab273
Compare
@sgpyc as reference owner could you please review this MR so that we can merge it. |
@sgpyc has reviewed & looks ok. |
MLCube for Bert
MLCube™ GitHub repository. MLCube™ wiki.
Project setup
An important requirement is that you must have Docker installed.
Go to mlcube directory and study what tasks MLCube implements.
cd ./mlcube mlcube describe
Demo execution
These tasks will use a demo dataset to execute a faster training workload for a quick demo (~8 min):
It's also possible to execute the two tasks in one single instruction:
MLCube tasks
Download dataset.
Process dataset.
Train SSD.
Run compliance checker.
Execute the complete pipeline
You can execute the complete pipeline with one single command.
TPU Training
For executing this benchmark using TPU you will need access to Google Cloud Platform, then you can create a project (Note: all the resources should be created in the same project) and after that, you will need to follow the next steps:
In the Google Cloud console, search for the Cloud TPU API page, then click Enable.
Then go to the virtual machine sections and select TPUs
Select create TPU node, fill in all the needed parameters, the recommended TPU type in the readme is v3-128 and the recommended TPU software version is 2.4.0.
The 3 most important parameters you need to remember are:
project name
,TPU name
, andTPU Zone
.After creating, click on the TPU name to see the TPU details, and copy the Service account (should int the format: service-xxxxxxxxxxxx@cloud-tpu.iam.gserviceaccount.com)
Go to Google Storage and create a new Bucket, define the needed parameters.
In the bucket list select the checkbox for the bucket you just created, then click on permissions, after that click on add principal.
In the new principals field paste the Service account from step 1, and then for the roles select, Storage Legacy Bucket Owner, Storage Legacy Bucket Reader and Storage Legacy Bucket Writer. Then click on save, this will allow the TPU to save the checkpoints during training.
The idea is to create a virtual machine instance containing all the code we will execute using MLCube.
Go to VM instances, then click on create instance and define all the needed parameters (No GPU needed).
IMPORTANT: In the section Identity and API access, check the option
Allow full access to all Cloud APIs
, this will allow the connection between this VM, the Cloud Storage Bucket and the TPU.Start the VM, connect to it via SSH, then use this tutorial to install Docker.
After installing Docker, clone the repo and install MLCube and follow the to install MLCube, then go to the path:
training/language_model/tensorflow/bert/mlcube
There modify the file at
workspace/parameters.yaml
and replace it with your data for:After that run the command:
This will start the MLCube task that internally in the host VM will send a gRPC with all the data to the TPU through gRPC, then the TPU will get the code to execute and the information of the Cloud Storage Bucket data and will execute the training workload.