reduce_op breaks on empty local tensor for some reduce operations #369

TheSlimvReal · 2019-09-02T08:48:22Z

Description
Some of the torch reduce operations behave differently when passing in an empty tensor. For example torch.sum returns 0 but torch.max or torch.min throw an exception when an empty tensor is given as argument.

To Reproduce
Steps to reproduce the behavior:
Run with 4 processes:

>>> import heat as ht
>>> a = ht.ones((3, 3), split=1)
>>> a.max()
RuntimeError: invalid argument 1: cannot perform reduction function max on tensor with no elements because the operation does not have an identity at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:189

Which module/class/function is affected?
All methods using the 'reduce_op' function
What are the circumstances under which the bug appears?
Some of the internally used torch functions do not work on empty local tensors.
What is the exact error-message/errorous behavious?
RuntimeError: invalid argument 1: cannot perform reduction function max on tensor with no elements because the operation does not have an identity at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:189

Expected behavior
The local functions should return a neutral value that does not affect the result (like the 0 for sum). The problem is to define this neutral argument for some other functions and I can not think of an general way to define this. Fact is that the mpi allreduce breaks when not every process provides values.

The text was updated successfully, but these errors were encountered:

TheSlimvReal · 2019-09-02T08:49:55Z

@ClaudiaComito Do you have any idea how an elegant solution might look like?

ClaudiaComito · 2019-11-25T13:16:48Z

@TheSlimvReal thanks and sorry I missed this message. I'll look into this.

ClaudiaComito · 2019-11-26T13:53:26Z

Hi again @TheSlimvReal. I could reproduce the error although not 100% (my system simply gets stuck, it doesn't throw an exception, possibly I just haven't waited long enough).

After researching a bit, I'm not sure that we should do anything at all, apart from maybe raising a warning when chunking and an exception when calling reduce_op. At that point we already know that some nodes have no data and will run into trouble.

I haven't been able to find a way to exclude nodes from MPI collective operations. I've been playing around with the option of using comm.Exscan instead of Allreduce. But that would imply:

expanding the size of the input tensor in the split dimension to have non-empty partial tensors on every rank
filling the fake added tensors with bogus data (factories.empty? ones?)
(keeping track of which rank is the last one with actual data, let's call it rank k)
running comm.Exscan instead of comm.Allreduce.
broadcasting the result of comm.Excan from rank k+1 to all ranks.

That would be the result of operation MPI.OP applied to all ranks from 0 to k, which is what we want. We could probably do this, but I'm not sure we should.

Does anybody among the MPI experts have a good solution? @d1saster @coquelin77 @Cdebus ?

coquelin77 · 2019-11-26T14:14:58Z

yeah ive noticed some buggy operations where there is no data on a process. However, it is my belief that this is something that, while not good, it okay. Since we are targeting large datasets for our analyses, it is my belief that we can disregard these errors on a global scale and instead solve these problems within the functions which might create empty tensors.

ClaudiaComito · 2019-12-06T10:30:42Z

Hi @coquelin77, I agree, although in practice the only way of dealing with it would be for the factories to throw an exception when the distribution leaves empty nodes. Or can anybody think of another way?

I'm not worried about anybody wanting to run calculations on a 3x3 tensor on 4 nodes. I worry more about the cases when you're running your job on n nodes, and the tensor size after convolution, maxpooling and whatnot, ends up being n-1 along the split axis. But I'm not sure how frequent this is going to be.

Markus-Goetz · 2020-04-02T13:12:40Z

Is this resolved by the introduction of the neutral element?

ClaudiaComito · 2020-04-02T13:20:21Z

Yes I think so! Thanks

TheSlimvReal changed the title ~~reduce op beaks on empty local tensor for some reduce operations~~ reduce_op breaks on empty local tensor for some reduce operations Oct 2, 2019

ClaudiaComito self-assigned this Nov 25, 2019

This was referenced Dec 19, 2019

Bug/369 reduce op empty tensor #443

Closed

Define and test neutral_element function for reduction operations #444

Closed

ClaudiaComito mentioned this issue Jan 8, 2020

Bug/369 reduce op empty tensor #446

Merged

12 tasks

ClaudiaComito closed this as completed Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce_op breaks on empty local tensor for some reduce operations #369

reduce_op breaks on empty local tensor for some reduce operations #369

TheSlimvReal commented Sep 2, 2019

TheSlimvReal commented Sep 2, 2019

ClaudiaComito commented Nov 25, 2019

ClaudiaComito commented Nov 26, 2019

coquelin77 commented Nov 26, 2019

ClaudiaComito commented Dec 6, 2019

Markus-Goetz commented Apr 2, 2020

ClaudiaComito commented Apr 2, 2020

reduce_op breaks on empty local tensor for some reduce operations #369

reduce_op breaks on empty local tensor for some reduce operations #369

Comments

TheSlimvReal commented Sep 2, 2019

TheSlimvReal commented Sep 2, 2019

ClaudiaComito commented Nov 25, 2019

ClaudiaComito commented Nov 26, 2019

coquelin77 commented Nov 26, 2019

ClaudiaComito commented Dec 6, 2019

Markus-Goetz commented Apr 2, 2020

ClaudiaComito commented Apr 2, 2020