You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
We first do TPU training
$ pipenv run accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 3
What is the name of the functionin your script that should be launched in all parallel scripts? [main]:
How many TPU cores should be used for distributed training? [1]:1
$ time pipenv run accelerate launch /tmp/accelerate/examples/nlp_example.py
...a lot of irrelevant warnings...
epoch 0: {'accuracy': 0.7720588235294118, 'f1': 0.8558139534883721}
epoch 1: {'accuracy': 0.8235294117647058, 'f1': 0.8615384615384616}
epoch 2: {'accuracy': 0.8676470588235294, 'f1': 0.9059233449477352}
________________________________________________________
Executed in 150.50 secs fish external
usr time 305.93 secs 0.00 micros 305.93 secs
sys time 36.76 secs 835.00 micros 36.76 secs
We then compare the result with CPU training by passing --cpu. Note that I am still getting the OpKernel warnings, which suggests the TPU is being used (or maybe the library is loaded but the hardware is not used? not sure). Anyway, that's the topic for another issue: #451.
$ time pipenv run accelerate launch /tmp/accelerate/examples/nlp_example.py --cpu
...a lot of irrelevant warnings...
2022-06-18 15:33:29.708624: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-06-18 15:33:29.708700: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
epoch 0: {'accuracy': 0.803921568627451, 'f1': 0.8701298701298701}
epoch 1: {'accuracy': 0.8431372549019608, 'f1': 0.8865248226950355}
epoch 2: {'accuracy': 0.8504901960784313, 'f1': 0.8953687821612349}
________________________________________________________
Executed in 372.19 secs fish external
usr time 271.73 mins 504.00 micros 271.73 mins
sys time 9.57 mins 260.00 micros 9.57 mins
Expected behavior
I expect the result of TPU training to be similar to that of CPU training, but both accuracy and F1 are off by a few percentages. Maybe it's because they are using different floating-point type (i.e. f64/f32/f16/bf16)? If that is the case, I would appreciate a warning like "casting f32 to f16; you may get slightly different results".This is probably related to https://github.com/huggingface/accelerate/issues/450, which reports single TPU and multiple TPU give different results.
The text was updated successfully, but these errors were encountered:
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
We first do TPU training
We then compare the result with CPU training by passing
--cpu
. Note that I am still getting theOpKernel
warnings, which suggests the TPU is being used (or maybe the library is loaded but the hardware is not used? not sure). Anyway, that's the topic for another issue: #451.Expected behavior
The text was updated successfully, but these errors were encountered: