Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel build with gnu make 4.4 fails #3899

Closed
haampie opened this issue Feb 2, 2023 · 18 comments · Fixed by #3902
Closed

parallel build with gnu make 4.4 fails #3899

haampie opened this issue Feb 2, 2023 · 18 comments · Fixed by #3902

Comments

@haampie
Copy link
Contributor

haampie commented Feb 2, 2023

I seem to get a build issue when using GNU make 4.4 with

'make' 'MAKE_NB_JOBS=0' 'ARCH=x86_64' 'TARGET=ZEN' 'USE_LOCKING=1' 'USE_OPENMP=1' 'USE_THREAD=1' 'libs' 'netlib' 'shared'

with -j16 at the top-level make (openblas is built with a recursive make; not sure if relevant) which fails after some time with

     29      Warning: Possible change of value in conversion from REAL(4) to INTEGER(4) at (1) [-Wconversion]
     30      sgebak.f:253:19:
     31      
     32        253 |                K = SCALE( I )
     33            |                   1
     34      Warning: Possible change of value in conversion from REAL(4) to INTEGER(4) at (1) [-Wconversion]
  >> 35      make[2]: *** No rule to make target '../libopenblas_zenp-r0.3.21.a', needed by '../libopenblas_zenp-r0.3.21.so'.  Stop.
     36      make[2]: Leaving directory '/tmp/user/spack-stage/spack-stage-openblas-0.3.21-rqvfhzdue2wzonslpabua3qssw7t6mu2/spack-src/exports'
  >> 37      make[1]: *** [Makefile:125: shared] Error 2
     38      make[1]: *** Waiting for unfinished jobs....
     39      /home/user/spack/lib/spack/env/gcc/gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fopenmp -fPIC -msse3 -msse4.1 -mavx -mavx2 -mavx2 -c -o sgebd2.o sgebd2.f
...

Is there anything that jumps out of the changelog that could explain this? https://lists.gnu.org/archive/html/info-gnu/2022-10/msg00008.html.

Maybe this?

Previously each target in a explicit grouped target rule was considered
individually: if the targets needed by the build were not out of date the
recipe was not run even if other targets in the group were out of date. Now
if any of the grouped targets are needed by the build, then if any of the
grouped targets are out of date the recipe is run and all targets in the
group are considered updated.

With GNU make 4.3 everything looks fine.

@martin-frbg
Copy link
Collaborator

No idea, but the item you quoted reads more like the opposite to me. In your case it seems to tackle "shared" in parallel with the two targets it depends on. (Though judging from the compiler output there could be parts of the static library already on disk). Could there be something else involved that might garble contents or timestamps, like a distributed filesystem or a host filesystem mounted inside a virtual machine ?
Any particular reason why you are calling individual targets instead of a full build ?

@martin-frbg
Copy link
Collaborator

Found a hint in the comments to a youtube rant about gmake breaking glibc builds - suggestion was to add --jobserver-style=pipe to the make options (supposedly from a discussion on the LFS mailing list, but apparently this did not work for the author of the video). Could you try ?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 4, 2023

Not reproducible so far on openSuSE tumbleweed with gmake 4.4 (though with the caveat that it is a VM with only 10 threads assigned to it, so maybe more&faster cores could change the picture - but at least this suggests that setting MAKE_NB_JOBS=10 might be an acceptable workaround if the problem is not caused by an underlying filesystem error)

@haampie
Copy link
Contributor Author

haampie commented Feb 4, 2023

Any particular reason why you are calling individual targets instead of a full build ?

It's just to exclude tests

Could there be something else involved that might garble contents or timestamps, like a distributed filesystem or a host filesystem mounted inside a virtual machine ?

It was a regular desktop, with a fast nvme disk

suggestion was to add --jobserver-style=pipe to the make options

Aha, I really picked gmake 4.4 because of its fifo jobserver support, it's much more reliable than pipes.

I'll try to come up with a better reproducer once I have access to the relevant machine again. If it just works for you, maybe it is something machine specific...

@martin-frbg
Copy link
Collaborator

I'll try on a bigger system sometime next week.
What's bad about having the tests as part of the build ? Granted they should not fail on fairly common systems but building&running them is not expected to require much time for the potential benefit of catching a bug.

@brada4
Copy link
Contributor

brada4 commented Feb 7, 2023

How many processors does actual machine have?
MAKE_NB_JOBS=0 makes openblas override default jobserver. It has to be negative to obey -j99 flag.

@brada4
Copy link
Contributor

brada4 commented Feb 7, 2023

@haampie please attach full build logs as spack build bug template suggests. Your guess about not finding file (../libopenblas_zenp-r0.3.21.a) should manifest much earlier when ar is creating that file NOT (or is it there in build dir?)

@brada4
Copy link
Contributor

brada4 commented Feb 7, 2023

Works fine on fedora rawhide upgraded earlier today.
The way your log says ld is making .so when .a is still being composed with netlib part. Can you try three make commands in a row?

@brada4
Copy link
Contributor

brada4 commented Feb 7, 2023

Got it. You pass -j`nproc` to make and that runs 3 jobs, one for each target, in parallel, no matter MAKE_NB_JOBS setting.
Changed from gmake 4.3. You need to run 3 commands, or patch up default target in makefile to omit tests.
@martin-frbg i have no clue how to work around this, setting dynamic to wait on others still intermixes all 3 parameters.

@brada4
Copy link
Contributor

brada4 commented Feb 7, 2023

So the repeater:
upgrade gmake 4.3->4.4
make -j `nproc` list of targets comprising full so file
no longer works.
Nothing specific to OpenBLAS, any similar build command must suffer.

@brada4
Copy link
Contributor

brada4 commented Feb 7, 2023

@haampie actually most other spack packages get built exactly invoking multiple make()-s
e.g.
https://github.com/spack/spack/blob/2516ed181ad05c34a4eb0948ca4bc6bd567a99ba/var/spack/repos/builtin/packages/n2p2/package.py#L84

@haampie
Copy link
Contributor Author

haampie commented Feb 8, 2023

Nothing specific to OpenBLAS

Question is if this is a regression in GNU make or not.

ELF files are broken/truncated, and if I repeatedly run

make -j `nproc` libs netlib shared

there's ... scary warnings like

ar: warning: ../../libopenblas_zenp-r0.3.21.a(dzsum.o) has a section extending past end of file

out of memory allocating 3760822486181093428 bytes after a total of 18446697842329902304 bytes

🤦‍♂️

I can try and file an issue upstream, to me make all and make [all's prereqs] should be equivalent...

@martin-frbg martin-frbg added the Bug in other software Compiler, Virtual Machine, etc. bug affecting OpenBLAS label Feb 8, 2023
@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 8, 2023

Can you confirm that the default make or make all still works (EDIT: as an all-core parallel build) with 4.4 ?

@haampie
Copy link
Contributor Author

haampie commented Feb 8, 2023

Yeah, make MAKE_NB_JOBS=0 -j16 --shuffle=random works

@haampie
Copy link
Contributor Author

haampie commented Feb 8, 2023

Ah, if I do make MAKE_NB_JOBS=0 -j16 libs && make MAKE_NB_JOBS=0 -j16 shared in that order, it errors:

home/harmen/spack/lib/spack/env/gcc/gcc -O2 -DSMALL_MATRIX_OPT -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=32 -DMAX_PARALLEL_NUMBER=1 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.21\" -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2 -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I..  -w -o linktest linktest.c ../libopenblas_zenp-r0.3.21.so -L/usr/lib/gcc/x86_64-linux-gnu/12 -L/usr/lib/gcc/x86_64-linux-gnu/12/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/12/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/12/../../..  -lgfortran -lm -lquadmath -lm -lc  && echo OK.
/usr/bin/ld: /tmp/cct3hvMp.o: in function `main':
linktest.c:(.text.startup+0x536): undefined reference to `iparam2stage_'
/usr/bin/ld: linktest.c:(.text.startup+0x53d): undefined reference to `ilaenv2stage_'
/usr/bin/ld: linktest.c:(.text.startup+0x591): undefined reference to `spotri_'
/usr/bin/ld: linktest.c:(.text.startup+0x5e5): undefined reference to `dpotri_'
/usr/bin/ld: linktest.c:(.text.startup+0x639): undefined reference to `cpotri_'

@martin-frbg
Copy link
Collaborator

shared definitely depends on netlib when you do not disable building LAPACK

@haampie
Copy link
Contributor Author

haampie commented Feb 8, 2023

Even simpler

$ make clean 1>/dev/null 2>&1 && make --trace shared MAKE_NB_JOBS=0
/usr/bin/ld: warning: /tmp/ccRR3ifI.o: missing .note.GNU-stack section implies executable stack
/usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
Makefile:125: target 'shared' does not exist
make -C exports so
make[1]: Entering directory '/tmp/harmen/spack-stage/spack-stage-openblas-0.3.21-zwgrfwv5q7nn3icvqos6nuosxohpcyfk/spack-src/exports'
make[1]: *** No rule to make target '../libopenblas_zenp-r0.3.21.a', needed by '../libopenblas_zenp-r0.3.21.so'.  Stop.
make[1]: Leaving directory '/tmp/harmen/spack-stage/spack-stage-openblas-0.3.21-zwgrfwv5q7nn3icvqos6nuosxohpcyfk/spack-src/exports'
make: *** [Makefile:125: shared] Error 2

This also happens with make 4.3

@martin-frbg
Copy link
Collaborator

Oh well. Does it work with an added dependeny shared : libs netlib in Makefile line 129 ? (Wonder what having that for something that "could not happen" will break next)

@martin-frbg martin-frbg removed the Bug in other software Compiler, Virtual Machine, etc. bug affecting OpenBLAS label Feb 8, 2023
haampie added a commit to haampie/spack that referenced this issue Feb 8, 2023
Fix a race in the makefile where the shared lib was built before the
object files were available.

See OpenMathLib/OpenBLAS#3899
haampie added a commit to haampie/spack that referenced this issue Feb 8, 2023
Fix a race in the makefile where the shared lib was built before the
object files were available.

See OpenMathLib/OpenBLAS#3899
haampie added a commit to haampie/spack that referenced this issue Feb 16, 2023
Fix a race in the makefile where the shared lib was built before the
object files were available.

See OpenMathLib/OpenBLAS#3899
alalazo pushed a commit to spack/spack that referenced this issue Feb 18, 2023
Fix a race in the makefile where the shared lib was built before the
object files were available.

See OpenMathLib/OpenBLAS#3899
koysean pushed a commit to koysean/spack that referenced this issue Feb 20, 2023
Fix a race in the makefile where the shared lib was built before the
object files were available.

See OpenMathLib/OpenBLAS#3899
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants