Early stopping fails on horovod with cannot unpack non-iterable NoneType object #3381

undertherain · 2020-09-07T09:26:45Z

🐛 Bug

When I do early stopping with horovod distributed training, it fails with cannot unpack non-iterable NoneType object in tqdm.
If fails only on some sets of training data. Also I see from the logs that early stopping was initiated only three times, while I'm training on 4 workers.
This makes me feel like the problem is that one of the workers did not initiate early stopping - presumably because each worker decides not by averaged, but by local validation loss.

        result = pl.EvalResult(early_stop_on=loss, checkpoint_on=loss)
        result.log("val_loss", loss, sync_dist=True)

As you can see I'm asking pytorch-lightning to average validation loss, but as was the case in my previous issue #3338 , the problem seems to be related to earlystopping using another dict.
Here's the full error message

Epoch 7:   0% 0/8 [00:00<?, ?it/s, loss=0.480, v_num=50]Traceback (most recent call last):
  File "main.py", line 72, in <module>       
    main()
  File "main.py", line 68, in main
    trainer.fit(model, data_module)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1016, in fit
    results = self.accelerator_backend.train()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/horovod_backend.py", line 108, in train
    result = self.trainer.run_pretrain_routine(self.trainer.model)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in run_pretrain_routine
    self.train()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 396, in train
    self.run_training_epoch()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 484, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 832, in run_training_batch
    opt_closure_result = self.optimizer_closure(
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 1065, in optimizer_closure
    model_ref.backward(self, closure_loss, optimizer, opt_idx)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 312, in backward
    loss.backward()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
Exception ignored in: <function tqdm.__del__ at 0x2b61156e6820>
Traceback (most recent call last):
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1086, in __del__
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1293, in close
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1471, in display
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1089, in __repr__
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: cannot unpack non-iterable NoneType object

Without early stopping it works ok. 

### Environment

CUDA:
- GPU:
  - Tesla V100-SXM2-16GB
  - Tesla V100-SXM2-16GB
  - Tesla V100-SXM2-16GB
  - Tesla V100-SXM2-16GB
- available: True
- version: 10.2
Packages:
- numpy: 1.19.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.9.1rc1
- tensorboard: 2.2.0
- tqdm: 4.46.1
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.2
- version: Proposal for help #1 SMP Fri Apr 20 16:44:24 UTC 2018

also had it on lightning 0.9.0 - actually did an upgrade to rc hoping that it will magically fix the problem

The text was updated successfully, but these errors were encountered:

undertherain · 2020-09-07T10:35:20Z

I've printed log dic from inside the _validate_condition_metric class:

{'val_early_stop_on': tensor(0.4890, device='cuda:0'), 'val_checkpoint_on': tensor(0.4890, device='cuda:0')}
{'val_early_stop_on': tensor(0.4890, device='cuda:0'), 'val_checkpoint_on': tensor(0.4890, device='cuda:0')}
{'val_early_stop_on': tensor(0.5420, device='cuda:1'), 'val_checkpoint_on': tensor(0.5420, device='cuda:1')}                                 
{'val_early_stop_on': tensor(0.5420, device='cuda:1'), 'val_checkpoint_on': tensor(0.5420, device='cuda:1')}
{'val_early_stop_on': tensor(1.0173, device='cuda:2'), 'val_checkpoint_on': tensor(1.0173, device='cuda:2')}
{'val_early_stop_on': tensor(0.4354, device='cuda:3'), 'val_checkpoint_on': tensor(0.4354, device='cuda:3')}
{'val_early_stop_on': tensor(1.0173, device='cuda:2'), 'val_checkpoint_on': tensor(1.0173, device='cuda:2')}
{'val_early_stop_on': tensor(0.4354, device='cuda:3'), 'val_checkpoint_on': tensor(0.4354, device='cuda:3')}

it indeed looks like every worker is doing checkpoint on own loss value :-
how come that doing this on averaged loss is not a default behaviour and how I can enforce that?

looks like one time is called from early stopping, once from checkpoint which I did not even ask for.

So far I like lightning, and would really like to express my appreciation to the developers
but funnily, unlike with other frameworks where I struggle to make them do something sometimes, with lightning I struggle to make it not do something :D

edenlightning · 2020-09-08T15:44:58Z

@tgaddair mind taking a look?

tgaddair · 2020-09-08T16:25:28Z

Sure, let me take a look.

edenlightning · 2020-09-16T18:20:07Z

@tgaddair any update?

tgaddair · 2020-09-16T18:23:44Z

Hey @edenlightning, not yet. I will try to take a look this week.

tgaddair · 2020-09-21T18:05:17Z

Hey @undertherain, sorry for the late response. Tried looking into this earlier but couldn't repro. I suspect there may be a few things going on here:

Early stopping criteria being applied independently on each worker. We should implement PL metrics aggregation for Horovod so the criteria can be applied consistently on every worker.
Something appears to be wrong with the params in the DistributedOptimizer. Sounds like this is related but not clear to me how without being able to repro myself (see https://discuss.pytorch.org/t/how-to-do-only-one-forward-propagation-per-epoch-and-multiple-backward-propagations-on-graph/65396/9 for more context).

@undertherain if you can provide a minimal repro that, will help a lot. I will also try to prioritize getting the metrics aggregation for Horovod to work in PL, which may also address this issue as a side effect.

undertherain · 2020-09-30T13:18:21Z

Thanks for looking at this
took a bit of time to make code not depending on my custom dataset
here I basically generate random tensors and predict 0 on horovod rank 0, and 1 in the rest of the workers

https://github.com/undertherain/pl_stopping_test/tree/01c8d98ba6136d7a9e1b26f07bb3b8d63eca2724

tgaddair · 2020-09-30T20:57:47Z

Thanks for putting this together @undertherain! I'll take a look and get back to you.

tgaddair · 2020-10-01T17:15:07Z

Hey @undertherain, here's the error I'm getting on a machine with 4 GPUs, looks like one worker is triggering early stopping but the others do not, leading to segfault:

[1,0]<stderr>:Saving latest checkpoint..
[1,0]<stderr>:Epoch 00003: early stopping triggered.
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "main.py", line 71, in <module>
[1,1]<stderr>:    main()
[1,1]<stderr>:  File "main.py", line 67, in main
[1,1]<stderr>:    trainer.fit(model, data_module)
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
[1,1]<stderr>:    result = fn(self, *args, **kwargs)
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1068, in fit
[1,1]<stderr>:    results = self.horovod_train(model)
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 242, in horovod_train
[1,1]<stderr>:    result = self.run_pretrain_routine(model)
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
[1,1]<stderr>:    self.train()
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
[1,1]<stderr>:    self.run_training_epoch()
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 491, in run_training_epoch
[1,1]<stderr>:    batch_output = self.run_training_batch(batch, batch_idx)
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 839, in run_training_batch
[1,1]<stderr>:    opt_closure_result = self.optimizer_closure(
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 1076, in optimizer_closure
[1,1]<stderr>:    model_ref.backward(self, closure_loss, optimizer, opt_idx)
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 324, in backward
[1,1]<stderr>:    loss.backward()
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
[1,1]<stderr>:    torch.autograd.backward(self, gradient, retain_graph, create_graph)
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
[1,1]<stderr>:    Variable._execution_engine.run_backward(
[1,1]<stderr>:RuntimeError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.

Is this consistent with the error you're seeing?

tgaddair · 2020-10-01T20:11:37Z

@undertherain can you try #3775 and see if it addresses your issue? With this change, I was able to get your test script to run to completion successfully. Note that this change requires Horovod >= v0.20.2.

undertherain · 2020-10-05T05:38:31Z

Yes it seems to work! Thanks!

undertherain added bug Something isn't working help wanted Open to be worked on labels Sep 7, 2020

edenlightning added this to the 0.9.x milestone Sep 8, 2020

edenlightning assigned tgaddair Sep 10, 2020

tgaddair mentioned this issue Oct 1, 2020

Horovod: fixed early stopping and added metrics aggregation #3775

Merged

7 tasks

edenlightning modified the milestones: 0.9.x, 1.0 Oct 4, 2020

edenlightning added the v1.0 allowed label Oct 4, 2020

edenlightning removed the help wanted Open to be worked on label Oct 5, 2020

Borda removed the v1.0 allowed label Oct 13, 2020

Borda modified the milestones: 1.0, 1.0.x Oct 13, 2020

edenlightning modified the milestones: 1.0.x, 1.1, 1.0.3 Oct 14, 2020

edenlightning added the 3rd party Related to a 3rd-party label Oct 19, 2020

edenlightning assigned teddykoker Oct 27, 2020

williamFalcon closed this as completed in #3775 Nov 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early stopping fails on horovod with cannot unpack non-iterable NoneType object #3381

Early stopping fails on horovod with cannot unpack non-iterable NoneType object #3381

undertherain commented Sep 7, 2020

undertherain commented Sep 7, 2020

edenlightning commented Sep 8, 2020

tgaddair commented Sep 8, 2020

edenlightning commented Sep 16, 2020

tgaddair commented Sep 16, 2020

tgaddair commented Sep 21, 2020

undertherain commented Sep 30, 2020

tgaddair commented Sep 30, 2020

tgaddair commented Oct 1, 2020 •

edited

Loading

tgaddair commented Oct 1, 2020 •

edited

Loading

undertherain commented Oct 5, 2020

Early stopping fails on horovod with cannot unpack non-iterable NoneType object #3381

Early stopping fails on horovod with cannot unpack non-iterable NoneType object #3381

Comments

undertherain commented Sep 7, 2020

🐛 Bug

undertherain commented Sep 7, 2020

edenlightning commented Sep 8, 2020

tgaddair commented Sep 8, 2020

edenlightning commented Sep 16, 2020

tgaddair commented Sep 16, 2020

tgaddair commented Sep 21, 2020

undertherain commented Sep 30, 2020

tgaddair commented Sep 30, 2020

tgaddair commented Oct 1, 2020 • edited Loading

tgaddair commented Oct 1, 2020 • edited Loading

undertherain commented Oct 5, 2020

tgaddair commented Oct 1, 2020 •

edited

Loading

tgaddair commented Oct 1, 2020 •

edited

Loading