Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce_op breaks on empty local tensor for some reduce operations #369

Closed
TheSlimvReal opened this issue Sep 2, 2019 · 7 comments
Closed
Assignees

Comments

@TheSlimvReal
Copy link
Contributor

Description
Some of the torch reduce operations behave differently when passing in an empty tensor. For example torch.sum returns 0 but torch.max or torch.min throw an exception when an empty tensor is given as argument.

To Reproduce
Steps to reproduce the behavior:
Run with 4 processes:

>>> import heat as ht
>>> a = ht.ones((3, 3), split=1)
>>> a.max()
RuntimeError: invalid argument 1: cannot perform reduction function max on tensor with no elements because the operation does not have an identity at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:189
  1. Which module/class/function is affected?
    All methods using the 'reduce_op' function
  2. What are the circumstances under which the bug appears?
    Some of the internally used torch functions do not work on empty local tensors.
  3. What is the exact error-message/errorous behavious?
    RuntimeError: invalid argument 1: cannot perform reduction function max on tensor with no elements because the operation does not have an identity at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:189

Expected behavior
The local functions should return a neutral value that does not affect the result (like the 0 for sum). The problem is to define this neutral argument for some other functions and I can not think of an general way to define this. Fact is that the mpi allreduce breaks when not every process provides values.

@TheSlimvReal
Copy link
Contributor Author

@ClaudiaComito Do you have any idea how an elegant solution might look like?

@TheSlimvReal TheSlimvReal changed the title reduce op beaks on empty local tensor for some reduce operations reduce_op breaks on empty local tensor for some reduce operations Oct 2, 2019
@ClaudiaComito
Copy link
Contributor

@TheSlimvReal thanks and sorry I missed this message. I'll look into this.

@ClaudiaComito ClaudiaComito self-assigned this Nov 25, 2019
@ClaudiaComito
Copy link
Contributor

Hi again @TheSlimvReal. I could reproduce the error although not 100% (my system simply gets stuck, it doesn't throw an exception, possibly I just haven't waited long enough).

After researching a bit, I'm not sure that we should do anything at all, apart from maybe raising a warning when chunking and an exception when calling reduce_op. At that point we already know that some nodes have no data and will run into trouble.

I haven't been able to find a way to exclude nodes from MPI collective operations. I've been playing around with the option of using comm.Exscan instead of Allreduce. But that would imply:

  • expanding the size of the input tensor in the split dimension to have non-empty partial tensors on every rank
  • filling the fake added tensors with bogus data (factories.empty? ones?)
  • (keeping track of which rank is the last one with actual data, let's call it rank k)
  • running comm.Exscan instead of comm.Allreduce.
  • broadcasting the result of comm.Excan from rank k+1 to all ranks.

That would be the result of operation MPI.OP applied to all ranks from 0 to k, which is what we want. We could probably do this, but I'm not sure we should.

Does anybody among the MPI experts have a good solution? @d1saster @coquelin77 @Cdebus ?

@coquelin77
Copy link
Member

yeah ive noticed some buggy operations where there is no data on a process. However, it is my belief that this is something that, while not good, it okay. Since we are targeting large datasets for our analyses, it is my belief that we can disregard these errors on a global scale and instead solve these problems within the functions which might create empty tensors.

@ClaudiaComito
Copy link
Contributor

Hi @coquelin77, I agree, although in practice the only way of dealing with it would be for the factories to throw an exception when the distribution leaves empty nodes. Or can anybody think of another way?

I'm not worried about anybody wanting to run calculations on a 3x3 tensor on 4 nodes. I worry more about the cases when you're running your job on n nodes, and the tensor size after convolution, maxpooling and whatnot, ends up being n-1 along the split axis. But I'm not sure how frequent this is going to be.

@Markus-Goetz
Copy link
Member

Is this resolved by the introduction of the neutral element?

@ClaudiaComito
Copy link
Contributor

Yes I think so! Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants