TPU: Crashes using trainer.test() #6230
Labels
accelerator: tpu
Tensor Processing Unit
bug
Something isn't working
help wanted
Open to be worked on
priority: 0
High priority task
🐛 Bug
trainer.test()
does not work with TPUs.There are a few different ways we've seen it crash.
1. Looks like a call to
barrier()
coming from__test_using_best_weights
So the barrier is coming from here. This is strange that
barrier
is being called - I think this means thatif not self._device_type == DeviceType.TPU
is mistakenly evaluating toTrue
? I think pytorch lightning spins up 8 processes for 8 TPU cores, is it possible only some of them are evaluating toTrue
?Basically it seems like at least 1 process is not making it to this point, which means the other processes are waiting in the barrier and the meetup never happens so we get the
RuntimeError
shown.2. Looks like a call to
xm.save()
is being misused:I think the problem is here with the usage of
xm.save()
.xm.save()
already handles the multiprocess case by checking the ordinal and only writing to disk if the process is on the master ordinal. In general, if you surroundxm.save()
withif
statements, it means some TPU cores enter theif
statement and some will not, so the cores that entered theif
statement will be waiting for those that didn't enter and eventually it will time out and crash.Repro methods
1. (Colab) Make 3 modifications to the BoringModel
tpu_cores=8
to the trainer cell2. (Google Cloud) Use the attached repro.py file in the following way:
conda activate torch-xla-1.7
pip install pytorch-lightning==1.2.1
export TPU_IP_ADDRESS=my.tpu.ip.addr
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
python3 repro.py
3. (Your CI setup) Modify TPU unit tests as follows:
trainer.test(test_dataloaders=DataLoader(RandomDataset(32, 2000), batch_size=32))
after some call totrainer.fit
Ran tests with
coverage run --source=pytorch_lightning -m pytest tests/models/test_tpu.py -v
. This should allow testing on the CI frameworkEnvironment
pip install pytorch-lightning==1.2.1
(note that earlier versions hang due to Hanging with TPUs on GCE VM #5841 )The text was updated successfully, but these errors were encountered: