-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ompi/v3.x.x bug since August 21: opal_datatype_pack.c:203 and opal_datatype_unpack.c:135 #6932
Comments
@ericch1 could you provide us with a test case? |
Ok, I will try do this this week. It is not easy to extract an example from the code, but since it looks like it's happening at the beginning I should be able to do it... |
It is not as easy as I tough... I will try to see if valgrind will give us some clues... |
Ok, 7b09c15 is not yet finished, but all tests that were failing are all good! So the real wrong merge is really only the modifications in f96994b... I was looking for the validation tests of the OpenMPI. Are you still using jenkins to tests merge requests? I found this: but when I try to look into the details of a build/tests, I only see very few tests that lasts 13 seconds: http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/10301/console Do I look at the good place? Is it possible or necessary to have a build with "--enable-debug" mode to catch my issue? |
To confirm that is a problem with the merge or with the datatype engine itself, can you run your test with the master ? |
Ok, I will test master this morning. Also, I have to look how to launch your tests database with my configuration (particularly --enable-debug) |
Ok, I just tested 390e0bc without "--enable-debug" and the problem is still there, but I have less information on stderr: [dockercentos7:17688] *** Process received signal *** [dockercentos7:17688] Signal: Aborted (6) [dockercentos7:17688] Signal code: (-6) [dockercentos7:17688] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7f14f92a35d0] [dockercentos7:17688] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f14f82c8207] [dockercentos7:17688] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f14f82c98f8] [dockercentos7:17688] [ 3] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt(_Z15attacheDebuggerv+0x2c5e)[0x41a3ee] [dockercentos7:17688] [ 4] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x2bd0)[0x7f150695a7e0] [dockercentos7:17688] [ 5] /lib64/libc.so.6(+0x36280)[0x7f14f82c8280] [dockercentos7:17688] [ 6] /lib64/libc.so.6(__sched_yield+0x7)[0x7f14f8374d47] [dockercentos7:17688] [ 7] /opt/openmpi-4.x_debug/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7f14f6f72dc5] [dockercentos7:17688] [ 8] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_request_default_wait+0x1f0)[0x7f14f9e7cb40] [dockercentos7:17688] [ 9] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xc9)[0x7f14f9ecc789] [dockercentos7:17688] [10] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_coll_base_allreduce_intra_recursivedoubling+0x296)[0x7f14f9ecd016] [dockercentos7:17688] [11] /opt/openmpi-4.x_debug/lib/libmpi.so.40(PMPI_Allreduce+0x183)[0x7f14f9e8f8a3] I will now launch the tests with branch master 8f32a59 to verify it is ok, then I will launch tests against master 94f26f5. Also, is https://github.com/open-mpi/mtt the suite which I could launch to test more deeply my local OpenMPI installation? (I never used it before). Thanks, |
Looking at your output it seems to me that the datatype representation could have been further optimized in order for the optimized description to look like
instead of
I created #6945 to address this optimization issue, but I don't think it fixes anything else. If you can give it a try let me know the outcome. Also, do you have a reproducer for your test case ? |
Ok, the sha 94f26f5 is bad. Here is the stderr: [dockercentos7:11303] opal_datatype_unpack.c:135 Pointer 0x8fd31a8 size 9 is outside [0x8fd0070,0x8fd31a1] for base ptr 0x8fd0070 count 525 and data [dockercentos7:11303] Datatype 0x8ebac70[] size 17 align 8 id 0 length 4 used 3 true_lb 0 true_ub 17 (true_extent 17) lb 0 ub 24 (extent 24) nbElems 3 loops 0 flags 114 (committed contiguous )-cC----GD--[---][---] contain OPAL_INT8:* OPAL_BOOL:* --C---P-D--[---][---] OPAL_INT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8) --C---P-D--[---][---] OPAL_INT8 count 1 disp 0x8 (8) blen 1 extent 8 (size 8) --C---P-D--[---][---] OPAL_BOOL count 1 disp 0x10 (16) blen 1 extent 1 (size 1) -------G---[---][---] OPAL_LOOP_E prev 3 elements first elem displacement 0 size of data 17 Optimized description -cC---P-DB-[---][---] OPAL_INT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8) -cC---P-DB-[---][---] OPAL_UINT1 count 1 disp 0x8 (8) blen 9 extent 9 (size 9) -------G---[---][---] OPAL_LOOP_E prev 2 elements first elem displacement 0 size of data 17 *** Error in `/home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt': corrupted size vs. prev_size: 0x0000000008fd31a0 *** ======= Backtrace: ========= /lib64/libc.so.6(+0x7f5d4)[0x7f7399f095d4] /lib64/libc.so.6(+0x816cb)[0x7f7399f0b6cb] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_MaillageUtil.so(_ZN17PAScatterMultipleISt4pairIlS0_IlbEEED1Ev+0x3d)[0x7f73aa7befed] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_MaillageUtil.so(_ZN11PAPartitionI52PATraitStockagePartitionConteneurDichotomiqueVecteurI10PtrPorteurI6SommetS2_EEE12lecturePriveER18PAPRFichierLecturel+0xe1d)[0x7f73aa7e166d] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_MaillageUtil.so(_ZN27LectureConnectiviteMaillage15lectureMaillageER18PAPRFichierLectureR8Maillage+0x592)[0x7f73aa7b98b2] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_Maillage.so(_ZN8Maillage24importeParalleleVersion1ERKSsRK17PAGroupeProcessusl+0xcf0)[0x7f73aac912e0] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_Maillage.so(_ZN8Maillage16importeParalleleERKSsRK17PAGroupeProcessusl+0xd90)[0x7f73aac93610] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_Contact.so(_ZN17CorpsAvecMaillage28lisDonneesDeBaseAvecMaillageERKSsRSsR20GestionFichierChampsRPS3_R24ListeEntitesGeometriquesRPS7_R9GeometrieRPSB_S1_RP17EntiteGeometriqueRSF_b+0x1f0)[0x7f73a6f10d60] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_Contact.so(_ZN7CorpsEF16lisDonneesDeBaseERKSs+0xf0)[0x7f73a6f60070] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt(_ZN17CollectionDeCorps10lisUnCorpsISsEE18SYEnveloppeMessageISsERKT_bbPP5Corps+0x8f2)[0x433e12] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt(_ZN17CollectionDeCorps16lisDonneesDeBaseIN9__gnu_cxx17__normal_iteratorIPSsSt6vectorISsSaISsEEEEEE18SYEnveloppeMessageISsET_SA_bb+0xdf)[0x43434f] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt[0x414a5b] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f7399eac3d5] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt[0x4159af] I will test ebe7ed6 tomorrow. |
@gpaulsen @hppritcha I think you guys should evaluate the severity of this issue for the upcoming release. |
I have good news! The patch in commit ebe7ed6 fixes all failing tests and all our other tests are 100% successful! :) Thanks, http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.30.09h45m24s_config.log |
The fix (and a tester to prevent it from happening in the future) is pending, it can be merged as soon as jenkins is happy. The patch should be easy backported to the stables (but I don not have time before next week). |
This patch fixes the merge of contiguous elements into larger but more compact datatypes, and allows for contiguous elements to have thir blocklen increasing instead of the count. The idea is to always maximize the blocklen, aka. the contiguous part of the datatype. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit 41e6f55) Addendum to original cherry-pick commit: This is a cherry-pick from master to the v3.1.x branch, which required some conflict resolution. This commit was shown to fix open-mpi#6932. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This patch fixes the merge of contiguous elements into larger but more compact datatypes, and allows for contiguous elements to have thir blocklen increasing instead of the count. The idea is to always maximize the blocklen, aka. the contiguous part of the datatype. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit 41e6f55) Addendum to original cherry-pick commit: This is a cherry-pick from master to the v3.1.x branch, which required some conflict resolution. This commit was shown to fix open-mpi#6932. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
ompi/v4.0.x is 100% functional for us again this morning, thanks a lot! |
This was fixed in v4.0.2 on the v4.0.x stream. @jsquyres, @bwbarrett are you still working with @bosilca for a similar PR to the v3.1.x or v3.0.x streams? If not please close this issue. |
It looks like it was too difficult to port back to v3.x -- the official guidance is that the fix is in the v4.0.x series. |
Hi,
EDIT: I modified the mentioned SHAs in this first message since it contains wrong info about the wrong sha
up to commit d3587f5, everything was fine, but
as of commit 390e0bc, we have some tests that are failing with errors like this:
Other example:
http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.19.20h08m05s_config.log
http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.19.20h08m05s_confdefs.h
http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.19.20h08m05s_ompi_info_all.txt
All failing tests have more than 1 process.
They are all showing opal_datatype_pack.c:203 and opal_datatype_unpack.c:135 as above.
Note that we are compiling/testing with --enable-debug ...
I do not have a MWE now, but I wanted to report asap so you can be aware of this.
Thanks,
Eric
The text was updated successfully, but these errors were encountered: