Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong MPI messages ht.int16 #497

Closed
mtar opened this issue Mar 10, 2020 · 0 comments · Fixed by #499
Closed

Wrong MPI messages ht.int16 #497

mtar opened this issue Mar 10, 2020 · 0 comments · Fixed by #499
Assignees
Labels
bug Something isn't working MPI Anything related to MPI communication

Comments

@mtar
Copy link
Collaborator

mtar commented Mar 10, 2020

Description
A clear and concise description of the bug and the associated functionality.

MPI messages are compromised, when int16 tensors are communicated

To Reproduce
Steps to reproduce the behavior:

  1. Which module/class/function is affected?
    int16 tensors, mpi functions

  2. What are the circumstances under which the bug appears?
    MPI communication with int16 arrays

  3. What is the exact error-message/errorous behavious?
    erroneous received messages; error message on 4 elements (see example)

Expected behavior
A clear and concise description of what you expected to happen.
Getting same message

Illustrative
If applicable, add screenshots or minimal examples to help explain your problem.

a = ht.array([[4,5,6,7],[6,7,8,9]], split=0, dtype=ht.int16)
if a.comm.rank == 0:
	print(a.comm.rank, a)
	a.comm.Send(a, dest=1)
elif a.comm.rank == 1:
	b = ht.empty((1,4), a.dtype)
	a.comm.Recv(b, source=0)
	print(a.comm.rank, b)
$ mpirun -n 2 python bug.py 
0 tensor([[4, 5, 6, 7]], dtype=torch.int16)
1 tensor([[4, 0, 6, 7]], dtype=torch.int16)
malloc_consolidate(): invalid chunk size
[ZAM:10232] *** Process received signal ***
[ZAM:10232] Signal: Aborted (6)
[ZAM:10232] Signal code:  (-6)
[ZAM:10232] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f516f8e3f20]
[ZAM:10232] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f516f8e3e97]
[ZAM:10232] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f516f8e5801]
[ZAM:10232] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x89897)[0x7f516f92e897]
[ZAM:10232] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9090a)[0x7f516f93590a]
[ZAM:10232] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x90bae)[0x7f516f935bae]
[ZAM:10232] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0x947d8)[0x7f516f9397d8]
[ZAM:10232] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x27d)[0x7f516f93c2ed]
[ZAM:10232] [ 8] python[0x5ac0b5]
[ZAM:10232] [ 9] python[0x56a894]
[ZAM:10232] [10] python[0x56bee3]
[ZAM:10232] [11] python(PyDict_SetItemString+0x153)[0x571633]
[ZAM:10232] [12] python(PyImport_Cleanup+0x76)[0x4f3256]
[ZAM:10232] [13] python(Py_FinalizeEx+0x5e)[0x6383ce]
[ZAM:10232] [14] python(Py_Main+0x395)[0x639435]
[ZAM:10232] [15] python(main+0xe0)[0x4b0f40]
[ZAM:10232] [16] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f516f8c6b97]
[ZAM:10232] [17] python(_start+0x2a)[0x5b2fda]
[ZAM:10232] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node ZAM exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Version Info
Which version are you using?
master

Additional comments
Any other comments here.

@mtar mtar added bug Something isn't working MPI Anything related to MPI communication labels Mar 10, 2020
@mtar mtar mentioned this issue Mar 10, 2020
4 tasks
@mtar mtar self-assigned this Mar 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working MPI Anything related to MPI communication
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant