TPU training gives different result than CPU training #452

nalzok · 2022-06-18T21:02:10Z

System Info

- `Accelerate` version: 0.10.0
- Platform: Linux-5.13.0-1023-gcp-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.22.4
- PyTorch version (GPU?): 1.11.0a0+gitbc2c6ed (False)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: TPU
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

We first do TPU training

$ pipenv run accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 3
What is the name of the function in your script that should be launched in all parallel scripts? [main]:
How many TPU cores should be used for distributed training? [1]:1

$ time pipenv run accelerate launch /tmp/accelerate/examples/nlp_example.py
...a lot of irrelevant warnings...
epoch 0: {'accuracy': 0.7720588235294118, 'f1': 0.8558139534883721}
epoch 1: {'accuracy': 0.8235294117647058, 'f1': 0.8615384615384616}
epoch 2: {'accuracy': 0.8676470588235294, 'f1': 0.9059233449477352}

________________________________________________________
Executed in  150.50 secs   fish           external
   usr time  305.93 secs    0.00 micros  305.93 secs
   sys time   36.76 secs  835.00 micros   36.76 secs

We then compare the result with CPU training by passing --cpu. Note that I am still getting the OpKernel warnings, which suggests the TPU is being used (or maybe the library is loaded but the hardware is not used? not sure). Anyway, that's the topic for another issue: #451.

$ time pipenv run accelerate launch /tmp/accelerate/examples/nlp_example.py --cpu
...a lot of irrelevant warnings...
2022-06-18 15:33:29.708624: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-06-18 15:33:29.708700: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
epoch 0: {'accuracy': 0.803921568627451, 'f1': 0.8701298701298701}
epoch 1: {'accuracy': 0.8431372549019608, 'f1': 0.8865248226950355}
epoch 2: {'accuracy': 0.8504901960784313, 'f1': 0.8953687821612349}

________________________________________________________
Executed in  372.19 secs   fish           external
   usr time  271.73 mins  504.00 micros  271.73 mins
   sys time    9.57 mins  260.00 micros    9.57 mins

Expected behavior

I expect the result of TPU training to be similar to that of CPU training, but both accuracy and F1 are off by a few percentages. Maybe it's because they are using different floating-point type (i.e. f64/f32/f16/bf16)? If that is the case, I would appreciate a warning like "casting f32 to f16; you may get slightly different results".

This is probably related to https://github.com/huggingface/accelerate/issues/450, which reports single TPU and multiple TPU give different results.

The text was updated successfully, but these errors were encountered:

nalzok added the bug Something isn't working label Jun 18, 2022

muellerzr added the TPU Bug or feature on TPU platforms label Jun 21, 2022

muellerzr mentioned this issue Jun 28, 2022

Rm gradient accumulation on TPU #479

Merged

muellerzr closed this as completed in #479 Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPU training gives different result than CPU training #452

TPU training gives different result than CPU training #452

nalzok commented Jun 18, 2022

TPU training gives different result than CPU training #452

TPU training gives different result than CPU training #452

Comments

nalzok commented Jun 18, 2022

System Info

Information

Tasks

Reproduction

Expected behavior