Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU training gives different result than CPU training #452

Closed
2 of 4 tasks
nalzok opened this issue Jun 18, 2022 · 0 comments · Fixed by #479
Closed
2 of 4 tasks

TPU training gives different result than CPU training #452

nalzok opened this issue Jun 18, 2022 · 0 comments · Fixed by #479
Labels
bug Something isn't working TPU Bug or feature on TPU platforms

Comments

@nalzok
Copy link

nalzok commented Jun 18, 2022

System Info

- `Accelerate` version: 0.10.0
- Platform: Linux-5.13.0-1023-gcp-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.22.4
- PyTorch version (GPU?): 1.11.0a0+gitbc2c6ed (False)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: TPU
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

We first do TPU training

$ pipenv run accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 3
What is the name of the function in your script that should be launched in all parallel scripts? [main]:
How many TPU cores should be used for distributed training? [1]:1

$ time pipenv run accelerate launch /tmp/accelerate/examples/nlp_example.py
...a lot of irrelevant warnings...
epoch 0: {'accuracy': 0.7720588235294118, 'f1': 0.8558139534883721}
epoch 1: {'accuracy': 0.8235294117647058, 'f1': 0.8615384615384616}
epoch 2: {'accuracy': 0.8676470588235294, 'f1': 0.9059233449477352}

________________________________________________________
Executed in  150.50 secs   fish           external
   usr time  305.93 secs    0.00 micros  305.93 secs
   sys time   36.76 secs  835.00 micros   36.76 secs

We then compare the result with CPU training by passing --cpu. Note that I am still getting the OpKernel warnings, which suggests the TPU is being used (or maybe the library is loaded but the hardware is not used? not sure). Anyway, that's the topic for another issue: #451.

$ time pipenv run accelerate launch /tmp/accelerate/examples/nlp_example.py --cpu
...a lot of irrelevant warnings...
2022-06-18 15:33:29.708624: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-06-18 15:33:29.708700: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
epoch 0: {'accuracy': 0.803921568627451, 'f1': 0.8701298701298701}
epoch 1: {'accuracy': 0.8431372549019608, 'f1': 0.8865248226950355}
epoch 2: {'accuracy': 0.8504901960784313, 'f1': 0.8953687821612349}

________________________________________________________
Executed in  372.19 secs   fish           external
   usr time  271.73 mins  504.00 micros  271.73 mins
   sys time    9.57 mins  260.00 micros    9.57 mins

Expected behavior

I expect the result of TPU training to be similar to that of CPU training, but both accuracy and F1 are off by a few percentages. Maybe it's because they are using different floating-point type (i.e. f64/f32/f16/bf16)? If that is the case, I would appreciate a warning like "casting f32 to f16; you may get slightly different results".

This is probably related to https://github.com/huggingface/accelerate/issues/450, which reports single TPU and multiple TPU give different results.
@nalzok nalzok added the bug Something isn't working label Jun 18, 2022
@muellerzr muellerzr added the TPU Bug or feature on TPU platforms label Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working TPU Bug or feature on TPU platforms
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants