Before launching training job, please copy the raw data and define the environmental variables (the bucket for staging and the bucket where you are going to store data as well as training job's outputs)
export BUCKET=YOUR_BUCKET gsutil -m cp gs://cloud-training-demos/babyweight/preproc/* gs://$BUCKET/babyweight/preproc/ export BUCKET_STAGING=YOUR_STAGING_BUCKET
You also need to have bazel installed.
The code below is based on this codelab (you can find more here).
You can dump profiles for a every n-th step. We are going to demonstrate how to collect dumps both in distributed mode as well as when training is don on a single machine (including training with a GPU accelerator). After you've trained a model (see examples below), you need to copy the dumps localy in order to inspect them. You can do it as follows:
rm -rf /tmp/profiler
mkdir -p /tmp/profiler
gsutil -m cp -r $OUTDIR/timeline*.json /tmp/profiler
And now you can launch a trace event profiling tool in your Chrome browser (chrome://tracing), load a specific timeline and visually inspect it.
You can launch the job as following:
OUTDIR=gs://$BUCKET/babyweight/hooks_basic
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
--region=us-west1 \
--module-name=trainer-hooks.task \
--package-path=trainer-hooks \
--job-dir=$OUTDIR \
--staging-bucket=gs://$BUCKET_STAGING \
--scale-tier=BASIC \
--runtime-version="1.10" \
-- \
--bucket=$BUCKET/babyweight \
--output_dir=${OUTDIR} \
--eval_int=1200 \
--train_steps=50000
You can launch the job as following:
OUTDIR=gs://$BUCKET/babyweight/hooks_standard
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
--region=us-west1 \
--module-name=trainer-hooks.task \
--package-path=trainer-hooks \
--job-dir=$OUTDIR \
--staging-bucket=gs://$BUCKET_STAGING \
--scale-tier=STANDARD_1 \
--runtime-version="1.10" \
-- \
--bucket=$BUCKET/babyweight \
--output_dir=${OUTDIR} \
--train_steps=50000
OUTDIR=gs://$BUCKET/babyweight/hooks_gpu
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
--region=us-west1 \
--module-name=trainer-hooks.task \
--package-path=trainer-hooks \
--job-dir=$OUTDIR \
--staging-bucket=gs://$BUCKET_STAGING \
--scale-tier=BASIC_GPU \
--runtime-version="1.10" \
-- \
--bucket=$BUCKET/babyweight \
--output_dir=${OUTDIR} \
--batch_size=8192 \
--train_steps=21000
OUTDIR=gs://$BUCKET/babyweight/hooks_basic-ext
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
--region=us-west1 \
--module-name=trainer-hooks-ext.task \
--package-path=trainer-hooks-ext \
--job-dir=$OUTDIR \
--staging-bucket=gs://$BUCKET_STAGING \
--scale-tier=BASIC \
--runtime-version="1.10" \
-- \
--bucket=$BUCKET/babyweight \
--output_dir=${OUTDIR} \
--eval_int=1200 \
--train_steps=15000
We can collect a deep profiling dump that can be later analyzed with a profiling CLI tool or with a profiler-ui as described in a post (ADD LINK HERE). Launch the training job as following:
OUTDIR=gs://$BUCKET/babyweight/profiler_standard
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
--region=us-west1 \
--module-name=trainer-deep-profiling.task \
--package-path=trainer-deep-profiling \
--job-dir=$OUTDIR \
--staging-bucket=gs://$BUCKET_STAGING \
--scale-tier=STANDARD_1 \
--runtime-version="1.10" \
-- \
--bucket=$BUCKET/babyweight \
--output_dir=${OUTDIR} \
--train_steps=100000
- In order to use profiler CLI, you need to build the profiler first:
git clone https://github.com/tensorflow/tensorflow.git cd tensorflow bazel build -c opt tensorflow/core/profiler:profiler
- Copy dumps locally:
rm -rf /tmp/profiler mkdir -p /tmp/profiler gsutil -m cp -r $OUTDIR/profiler /tmp
- Launch the profiler with
bazel-bin/tensorflow/core/profiler/profiler --profile_path=/tmp/profiler/$(ls /tmp/profiler/ | head -1)
You can also use a profiler-ui, i.e. a web interface for a tensorflow profiler.
- If you'd like to install pprof, please follow the installation instructions.
- Clone the repository:
git clone https://github.com/tensorflow/profiler-ui.git cd profiler-ui
- Copy dumps locally:
rm -rf /tmp/profiler mkdir -p /tmp/profiler gsutil -m cp -r $OUTDIR/$MODEL/profiler /tmp
- Launch the profiler with
python ui.py --profile_context_path=/tmp/profiler/$(ls /tmp/profiler/ | head -1)