Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{2023.06}[2023a] PyTorch-Bundle v2.1.2 #585

Open
wants to merge 13 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from

Conversation

casparvl
Copy link
Collaborator

@casparvl casparvl commented May 23, 2024

15 out of 137 required modules missing:

* parameterized/0.9.0-GCCcore-12.3.0 (parameterized-0.9.0-GCCcore-12.3.0.eb)
* tqdm/4.66.1-GCCcore-12.3.0 (tqdm-4.66.1-GCCcore-12.3.0.eb)
* LLVM/14.0.6-GCCcore-12.3.0-llvmlite (LLVM-14.0.6-GCCcore-12.3.0-llvmlite.eb)
* Scalene/1.5.26-GCCcore-12.3.0 (Scalene-1.5.26-GCCcore-12.3.0.eb)
* gperftools/2.12-GCCcore-12.3.0 (gperftools-2.12-GCCcore-12.3.0.eb)
* SentencePiece/0.2.0-GCC-12.3.0 (SentencePiece-0.2.0-GCC-12.3.0.eb)
* tensorboard/2.15.1-gfbf-2023a (tensorboard-2.15.1-gfbf-2023a.eb)
* imageio/2.33.1-gfbf-2023a (imageio-2.33.1-gfbf-2023a.eb)
* libmad/0.15.1b-GCCcore-12.3.0 (libmad-0.15.1b-GCCcore-12.3.0.eb)
* SoX/14.4.2-GCCcore-12.3.0 (SoX-14.4.2-GCCcore-12.3.0.eb)
* NLTK/3.8.1-foss-2023a (NLTK-3.8.1-foss-2023a.eb)
* numba/0.58.1-foss-2023a (numba-0.58.1-foss-2023a.eb)
* scikit-image/0.22.0-foss-2023a (scikit-image-0.22.0-foss-2023a.eb)
* librosa/0.10.1-foss-2023a (librosa-0.10.1-foss-2023a.eb)
* PyTorch-bundle/2.1.2-foss-2023a (PyTorch-bundle-2.1.2-foss-2023a.eb)

Copy link

eessi-bot bot commented May 23, 2024

Instance eessi-bot-mc-aws is configured to build:

  • arch x86_64/generic for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/generic for repo eessi-hpc.org-2023.06-software
  • arch x86_64/generic for repo eessi.io-2023.06-compat
  • arch x86_64/generic for repo eessi.io-2023.06-software
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-software
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-software
  • arch aarch64/generic for repo eessi.io-2023.06-compat
  • arch aarch64/generic for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-software

Copy link

eessi-bot bot commented May 23, 2024

Instance eessi-bot-mc-azure is configured to build:

  • arch x86_64/amd/zen4 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen4 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen4 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen4 for repo eessi.io-2023.06-software

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3

Copy link

eessi-bot bot commented May 23, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 resulted in:

Copy link

eessi-bot bot commented May 23, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented May 23, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_585/11283

date job status comment
May 23 09:23:40 UTC 2024 submitted job id 11283 awaits release by job manager
May 23 09:24:02 UTC 2024 released job awaits launch by Slurm scheduler
May 23 09:28:04 UTC 2024 running job 11283 is running
May 23 09:33:17 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-11283.out
❌ found message matching ERROR:
✅ no message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
May 23 09:33:17 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11283.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

== No easyconfigs left to be built.
ERROR: Missing dependencies: SentencePiece/0.2.0-foss-2023a, SoX/14.4.2-foss-2023a (no easyconfig file or existing module found)
== Build succeeded for 0 out of 0
  >> download succeeded: https://github.com/easybuilders/easybuild-easyconfigs/archive/7124863ed588066e5a988b4073d91381497a7f64.tar.gz
  >> running command:
        [started at: 2024-05-23 09:28:34]
        [working dir: /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders]
        [output logged in /tmp/eb-dlj1ws2x/eb-9tn8fu3_/easybuild-run_cmd-t6inmlw4.log]
        tar xzf /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders/7124863ed588066e5a988b4073d91381497a7f64.tar.gz
  >> command completed: exit 0, ran in 00h00m01s
== found valid index for /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/EasyBuild/4.9.1/easybuild/easyconfigs, so using it...
== Running parse hook for PyTorch-bundle-2.1.2-foss-2023a.eb...
== Running parse hook for foss-2023a.eb...
== resolving dependencies ...
== Running parse hook for parameterized-0.9.0-GCCcore-12.3.0.eb...
== Running parse hook for GCCcore-12.3.0.eb...
== Running parse hook for GCCcore-12.3.0.eb...
== Running parse hook for scikit-image-0.22.0-foss-2023a.eb...
== Running parse hook for librosa-0.10.1-foss-2023a.eb...
== Running parse hook for imageio-2.33.1-gfbf-2023a.eb...
== Running parse hook for gfbf-2023a.eb...
== Running parse hook for gfbf-2023a.eb...
== Running parse hook for GCC-12.3.0.eb...
== Running parse hook for FlexiBLAS-3.3.1-GCC-12.3.0.eb...
== Running parse hook for GCC-12.3.0.eb...
== Running parse hook for FFTW-3.3.10-GCC-12.3.0.eb...
== Running parse hook for NLTK-3.8.1-foss-2023a.eb...
== Running parse hook for numba-0.58.1-foss-2023a.eb...
== Running parse hook for Scalene-1.5.26-GCCcore-12.3.0.eb...
== Running parse hook for tqdm-4.66.1-GCCcore-12.3.0.eb...
== Running parse hook for LLVM-14.0.6-GCCcore-12.3.0-llvmlite.eb...
== Running parse hook for tensorboard-2.15.1-gfbf-2023a.eb...

I guess that with --from-pr we got SentencePiece and Sox correctly since they were already in develop, but with --from-commit we don't? Should I combine multiple --from-commit's for each of those (i.e. look up the commit that provided the required SentencePiece, etc)?

@bedroge
Copy link
Collaborator

bedroge commented May 23, 2024

== No easyconfigs left to be built.
ERROR: Missing dependencies: SentencePiece/0.2.0-foss-2023a, SoX/14.4.2-foss-2023a (no easyconfig file or existing module found)
== Build succeeded for 0 out of 0
  >> download succeeded: https://github.com/easybuilders/easybuild-easyconfigs/archive/7124863ed588066e5a988b4073d91381497a7f64.tar.gz
  >> running command:
        [started at: 2024-05-23 09:28:34]
        [working dir: /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders]
        [output logged in /tmp/eb-dlj1ws2x/eb-9tn8fu3_/easybuild-run_cmd-t6inmlw4.log]
        tar xzf /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders/7124863ed588066e5a988b4073d91381497a7f64.tar.gz
  >> command completed: exit 0, ran in 00h00m01s
== found valid index for /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/EasyBuild/4.9.1/easybuild/easyconfigs, so using it...
== Running parse hook for PyTorch-bundle-2.1.2-foss-2023a.eb...
== Running parse hook for foss-2023a.eb...
== resolving dependencies ...
== Running parse hook for parameterized-0.9.0-GCCcore-12.3.0.eb...
== Running parse hook for GCCcore-12.3.0.eb...
== Running parse hook for GCCcore-12.3.0.eb...
== Running parse hook for scikit-image-0.22.0-foss-2023a.eb...
== Running parse hook for librosa-0.10.1-foss-2023a.eb...
== Running parse hook for imageio-2.33.1-gfbf-2023a.eb...
== Running parse hook for gfbf-2023a.eb...
== Running parse hook for gfbf-2023a.eb...
== Running parse hook for GCC-12.3.0.eb...
== Running parse hook for FlexiBLAS-3.3.1-GCC-12.3.0.eb...
== Running parse hook for GCC-12.3.0.eb...
== Running parse hook for FFTW-3.3.10-GCC-12.3.0.eb...
== Running parse hook for NLTK-3.8.1-foss-2023a.eb...
== Running parse hook for numba-0.58.1-foss-2023a.eb...
== Running parse hook for Scalene-1.5.26-GCCcore-12.3.0.eb...
== Running parse hook for tqdm-4.66.1-GCCcore-12.3.0.eb...
== Running parse hook for LLVM-14.0.6-GCCcore-12.3.0-llvmlite.eb...
== Running parse hook for tensorboard-2.15.1-gfbf-2023a.eb...

I guess that with --from-pr we got SentencePiece and Sox correctly since they were already in develop, but with --from-commit we don't? Should I combine multiple --from-commit's for each of those (i.e. look up the commit that provided the required SentencePiece, etc)?

I (and @trz42 and @ocaisa ) also saw issues with using --from-commit, see for instance #558 (comment).

@bedroge
Copy link
Collaborator

bedroge commented May 23, 2024

Could you try using the merge commit (see bottom of the PR: 04ccd901a613631b00ccbe504d6d66d6a6c2febb) and check if that does work?

@casparvl
Copy link
Collaborator Author

I tried manually

eb -D PyTorch-bundle-2.1.2-foss-2023a-CUDA-12.1.1.eb --from-commit 04ccd901a613631b00ccbe504d6d66d6a6c2febb

But that still shows missing EasyConfigs.

@bedroge
Copy link
Collaborator

bedroge commented May 23, 2024

I tried manually

eb -D PyTorch-bundle-2.1.2-foss-2023a-CUDA-12.1.1.eb --from-commit 04ccd901a613631b00ccbe504d6d66d6a6c2febb

But that still shows missing EasyConfigs.

Guess we need to stick to --from-pr then until we find a solution for this...

@casparvl
Copy link
Collaborator Author

I was being stupid. I made a mistake in what I ran manually: that's with CUDA. That's not included in that PR/commit for sure... :P However,

eb -D PyTorch-bundle-2.1.2-foss-2023a.eb --from-commit 04ccd901a613631b00ccbe504d6d66d6a6c2febb

shows the same missing easyconfigs. I've switched to --from-pr for now. I'll try to create an upstream issue on EasyBuild later (if there isn't any yet).

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3

Copy link

eessi-bot bot commented May 23, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 resulted in:

Copy link

eessi-bot bot commented May 23, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented May 23, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_585/11288

date job status comment
May 23 11:50:20 UTC 2024 submitted job id 11288 awaits release by job manager
May 23 11:50:42 UTC 2024 released job awaits launch by Slurm scheduler
May 23 11:55:44 UTC 2024 running job 11288 is running
May 23 12:23:21 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-11288.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1716466678.tar.gzsize: 162 MiB (170601270 bytes)
entries: 6321
modules under 2023.06/software/linux/x86_64/amd/zen3/modules/all
imageio/2.33.1-gfbf-2023a.lua
LLVM/14.0.6-GCCcore-12.3.0-llvmlite.lua
NLTK/3.8.1-foss-2023a.lua
numba/0.58.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
tqdm/4.66.1-GCCcore-12.3.0.lua
software under 2023.06/software/linux/x86_64/amd/zen3/software
imageio/2.33.1-gfbf-2023a
LLVM/14.0.6-GCCcore-12.3.0-llvmlite
NLTK/3.8.1-foss-2023a
numba/0.58.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
tqdm/4.66.1-GCCcore-12.3.0
other under 2023.06/software/linux/x86_64/amd/zen3
no other files in tarball
May 23 12:23:21 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11288.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

This is the actual failure:

== 2024-05-23 12:17:16,011 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): Sanity check failed: extensions sanity check failed for 1 extensions: soundfile
failing sanity check for 'soundfile' extension: command "python -c "import soundfile"" failed; output:
Traceback (most recent call last):
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 161, in <module>
    import _soundfile_data  # ImportError if this doesn't exist
    ^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named '_soundfile_data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 171, in <module>
    _snd = _ffi.dlopen(_libname)
           ^^^^^^^^^^^^^^^^^^^^^
OSError: cannot load library 'libsndfile.so.1': libsndfile.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 192, in <module>
    _snd = _ffi.dlopen(_explicit_libname)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: cannot load library 'libsndfile.so': libsndfile.so: cannot open shared object file: No such file or directory,  (at easybuild/framework/easyblock.py:3669 in _sanity_check_step)

I guess this should be provide by the module libsndfile/1.2.2-GCCcore-12.3.0, but I'm not sure what path's get searched by this dlopen call. I think that searches LD_LIBRARY_PATH, which we don't set in EESSI.

I guess this is a pretty fundamental question: how do we make dlopen calls succesfully find libs from the EESSI software prefix?

@ocaisa
Copy link
Member

ocaisa commented May 23, 2024

See #192 , the Alliance have a solution for this

@casparvl
Copy link
Collaborator Author

Spot on, it is indeed the issue of ctypes.util's find_library only returning the filename, not the full path. Or at least: I see that it is using find_library here to ge tthe _libname, which is then used as the dlopen argument. I.e. I expect that if find_library correctly returns the full path, the dlopen call would have succeeded.

The downside is that the Alliance's solution looks quite involved... The upside is we can probably use their shadowing lib from https://github.com/ComputeCanada/custom_ctypes/tree/main/lib . What I don't fully understand is the sitecustomize and ebpythonprefixes stuff they do. Also, they seem to make a seperate module out of it, I'm not entirely sure why (do they only load it when they need to?).

I guess my main consideration would be if we shouldn't just always have this patched find_library function in place. In that case, a simple patch to the installation that normally contains ctypes (I guess that's in the standard Python installation?) would then be enough...

@ocaisa
Copy link
Member

ocaisa commented May 23, 2024

I was also thinking that maybe a patch on ctypes is enough, I don't fully understand all the other stuff going on with them

@trz42
Copy link
Collaborator

trz42 commented May 30, 2024

The changes they apply to ctypes are quite small. See below for Python/3.11.3 Maybe we could apply these changes "in-place" in a build container to test if they solve the issue?

diff -u /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/util.py custom_ctypes/lib/python3.11/site-packages/ctypes/util.py
--- /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/util.py      2024-04-30 16:38:09.000000000 +0200
+++ custom_ctypes/lib/python3.11/site-packages/ctypes/util.py   2024-05-30 16:17:44.000000000 +0200
@@ -326,7 +326,10 @@

         def find_library(name):
             # See issue #9998
+            lib = _findLib_gcc(name)
+            # return absolute path
             return _findSoname_ldconfig(name) or \
+                    os.path.join(os.path.dirname(lib), _get_soname(lib)) or \
                    _get_soname(_findLib_gcc(name)) or _get_soname(_findLib_ld(name))

 ################################################################

@trz42
Copy link
Collaborator

trz42 commented May 31, 2024

I tried to replace the utils.py globally (for all installations in NorESSI#387), but that leads to a failure when building/installing scikitimage already (third package). See below for details. When I don't use the modified utils.py it fails with the same error @casparvl has hit when building librosa.

    File "/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/util.py", line 332, in find_library
      os.path.join(os.path.dirname(lib), _get_soname(lib)) or \
                   ^^^^^^^^^^^^^^^^^^^^
    File "<frozen posixpath>", line 152, in dirname
  TypeError: expected str, bytes or os.PathLike object, not NoneType
  error: subprocess-exited-with-error

Will try to use that modified file only when building/using librosa.

@trz42
Copy link
Collaborator

trz42 commented Jun 4, 2024

I've worked out a fix for the import soundfile issue. See NorESSI#391

If it works out there, I'll test it with PyTorch-bundle. We can dicuss how we should employ this fix (maybe it's better to ship the custom ctypes with EESSI, but for lack of better idea where to put it the above PR puts it under host_injections).

@trz42
Copy link
Collaborator

trz42 commented Jun 6, 2024

I updated NorESSI#387 with the fixes in NorESSI#391 to work around the failing sanity check (python -c 'import soundfile'). PyTorch (with CUDA) builds for x86_64/{generic,intel/skylake_avx512,amd/zen2}. It fails for aarch64/generic and x86_64/intel/broadwell with a different issue. It could be worth applying the fixes also here and see which builds work (and which don't).

@casparvl
Copy link
Collaborator Author

casparvl commented Jun 10, 2024

@trz42 I remember you said in a meeting that simply patching ctypes caused issues in other packages. I think the idea was then to pick up a 'patched' ctypes only for a specific phase of the build (the test step? I don't fully remember...). However, it was also brought up in that meeting that this fix would make the build pass, but users would still run into it at runtime, right?

I was thinking: what if we patch ctypes to add a different API call. I.e. a find_library with an extra argument full_path (which defaults to false, i.e. the default behaviour). And then, we patch librosa to call find_library(..., full_path=true). That way, you only get the full path back if you intentionaly patch an application that depends on this find_library call. That should have no unintended fallout (because the default function call retains it's prior behaviour of only returning the library name, not the full library path), while giving us an easy way to fix future similar issues (simply patch the function calls to find_library to add the full_path=true argument). It would also mean it is solved for these packages at runtime as well (we simply patched the package).

Now, this would be super annoying if there are packages that do a lot of find_library calls, since it means a lot of patching. But I assume that should be pretty limited (I mean... how many external libraries can a single package use, right...? Or did I now jynx it :P)

Copy link

eessi-bot bot commented Aug 22, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build arch:aarch64/generic repo:eessi.io-2023.06-software from casparvl

    • expanded format: build architecture:aarch64/generic repository:eessi.io-2023.06-software
  • handling command build architecture:aarch64/generic repository:eessi.io-2023.06-software resulted in:

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Aug 22, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build arch:aarch64/generic repo:eessi.io-2023.06-software from casparvl

    • expanded format: build architecture:aarch64/generic repository:eessi.io-2023.06-software
  • handling command build architecture:aarch64/generic repository:eessi.io-2023.06-software resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 22, 2024

New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_585/16826

  • fails two sanity checks ... see part of the log below
== 2024-08-22 14:32:22,373 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): Sanity check failed: sanity check command spm_train --help | grep accept_language exited with code 1 (output: /bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.36' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.35' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libgcc_s.so.1)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libgcc_s.so.1)
)
sanity check command python -c 'import sentencepiece' exited with code 1 (output: /bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.36' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.35' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libgcc_s.so.1)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libgcc_s.so.1)
) (at easybuild/framework/easyblock.py:3663 in _sanity_check_step)
date job status comment
Aug 22 14:11:24 UTC 2024 submitted job id 16826 awaits release by job manager
Aug 22 14:11:37 UTC 2024 released job awaits launch by Slurm scheduler
Aug 22 14:17:39 UTC 2024 running job 16826 is running
Aug 22 15:23:13 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-16826.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-generic-1724337267.tar.gzsize: 129 MiB (135615311 bytes)
entries: 4670
modules under 2023.06/software/linux/aarch64/generic/modules/all
gperftools/2.12-GCCcore-12.3.0.lua
imageio/2.33.1-gfbf-2023a.lua
NLTK/3.8.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
tensorboard/2.15.1-gfbf-2023a.lua
tqdm/4.66.1-GCCcore-12.3.0.lua
software under 2023.06/software/linux/aarch64/generic/software
gperftools/2.12-GCCcore-12.3.0
imageio/2.33.1-gfbf-2023a
NLTK/3.8.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
tensorboard/2.15.1-gfbf-2023a
tqdm/4.66.1-GCCcore-12.3.0
other under 2023.06/software/linux/aarch64/generic
2023.06/init/easybuild/eb_hooks.py
Aug 22 15:23:13 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 18/18 test case(s) from 18 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-16826.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 22, 2024

Failure of the test suite on x86_64 with:

FAILURE INFO for EESSI_PyTorch_torchvision_CPU %nn_model=resnet50 %scale=1_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a (run: 1/1)
  * Description: Benchmark that runs a selected torchvision model on synthetic data
  * System partition: BotBuildTests:default
  * Environment: default
  * Stage directory: /project/60006/SHARED/jobs/2024.08/pr_585/event_33c66470-5ff9-11ef-924c-fc9f4cfa4137/run_000/linux_x86_64_amd_zen3/eessi.io-2023.06-software/reframe_runs/stage/BotBuildTests/default/default/EESSI_PyTorch_torchvision_CPU_39d248a6
  * Node list:
  * Job type: local (id=None)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: setup
  * Rerun with '-n /39d248a6 -p default --system BotBuildTests:default -r'
  * Reason: attribute error: EESSI-test-suite/eessi/testsuite/utils.py:163: Processor information (num_cores_per_numa_node) missing. Check that processor information is either autodetected (see https://reframe-hpc.readthedocs.io/en/stable/configure.html#proc-autodetection), or manually set in the ReFrame configuration file (see https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#processor-info).
    raise AttributeError(msg)

Ok, we didn't define that in our template config file. Also, it is particular to newer versions of ReFrame. I'll create a PR that adds a new version of ReFrame and I'll create a PR that no longer uses hard-coded processor features, but autodetects them. The challenge is that with the local spawner, if we use a single config file, it doesn't have the specific partition we submitted to. But, we can get that from the job environment and inject it in the config. I'll do that in #682 and a new ReFrame in #708

@trz42
Copy link
Collaborator

trz42 commented Aug 27, 2024

Copying some findings from Slack here:

To me it seems the problem is a combination of what EasyBuild uses to run commands (it uses /bin/bash) and that we currently set LD_PRELOAD too early via the modified module file. Below are a few examples illustrating what happens.

The original TLS (Thread-Local Storage) allocation error... (withou LD_PRELOAD, just running the import after loading Python, gperftools and setting PATH and PYTHONPATH to the build directory for SentencePiece)

bot@aarch64-generic-node3 /tmp/bot $ python -c 'import sentencepiece'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/SentencePiece/0.2.0-GCC-12.3.0/lib/python3.11/site-packages/sentencepiece/__init__.py", line 10, in <module>
    from . import _sentencepiece
ImportError: /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4: cannot allocate memory in static TLS block

With LD_PRELOAD this succeeds (same env otherwise)...

bot@aarch64-generic-node3 /tmp/bot $ LD_PRELOAD=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so python -c 'import sentencepiece'

However, that is not how EasyBuild runs the sanitycheck command. It rather runs the following (which fails)...

bot@aarch64-generic-node3 /tmp/bot $ LD_PRELOAD=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so /bin/bash -c "python -c 'import sentencepiece'"
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.36' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.35' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libgcc_s.so.1)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libgcc_s.so.1)

The above error is what we got in the last build job for aarch64/generic. If we run the original command in a subshell (as EasyBuild does), we get the original error (just to illustrate that we "correctly" emulate what EasyBuild does)...

bot@aarch64-generic-node3 /tmp/bot $ /bin/bash -c "python -c 'import sentencepiece'"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/SentencePiece/0.2.0-GCC-12.3.0/lib/python3.11/site-packages/sentencepiece/__init__.py", line 10, in <module>
    from . import _sentencepiece
ImportError: /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4: cannot allocate memory in static TLS block

If we set LD_PRELOAD just before we run python, it works...

bot@aarch64-generic-node3 /tmp/bot $ /bin/bash -c "LD_PRELOAD=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so python -c 'import sentencepiece'"

I think, setting LD_PRELOAD in the module for SentencePiece could work. However, when running EasyBuild we'll likely run into issues because it uses /bin/bash to run commands. If it would use bash from the compat layer it would work. See example below

bot@aarch64-generic-node3 /tmp/bot $ LD_PRELOAD=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4 /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/bin/bash -c "python -c 'import sentencepiece'"

To me it seems that /bin/bash and /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4 depend on different symbols (which sounds logical), hence it is critical to only preload the latter library after /bin/bash's dependencies have been resolved.

@boegel
Copy link
Contributor

boegel commented Aug 27, 2024

@trz42 Doesn't this mean that EasyBuild should be using the /bin/bash from the compat layer, so prefixed with sysroot in EasyBuild lingo?

@trz42
Copy link
Collaborator

trz42 commented Aug 27, 2024

@trz42 Doesn't this mean that EasyBuild should be using the /bin/bash from the compat layer, so prefixed with sysroot in EasyBuild lingo?

Maybe. If sysroot implies that it can expect a sysroot/bin/bash it could work. However, it has only resulted in a problem when we use LD_PRELOAD. So, maybe we should look for another solution.

I'm trying to solve the issue with a parse hook where I just add LD_PRELOAD=... in front of the failing sanity check command and another hook to add LD_PRELOAD=... in the module file. However, the latter has to be done after the sanity check has been run.

A better fix could be what you suggest, in some cases or always, we prefix the exec_cmd = "/bin/bash" (/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/EasyBuild/4.9.2/lib/python3.11/site-packages/easybuild/tools/run.py:229) with sysroot when it is present. Then we could just add the setting of LD_PRELOAD=... in the module file and it should work both while using the module and while running the sanity check.

@casparvl
Copy link
Collaborator Author

@trz42 Doesn't this mean that EasyBuild should be using the /bin/bash from the compat layer, so prefixed with sysroot in EasyBuild lingo?

To me, this makes a lot of sense actually. If you're explicitly invoking a shell to run your command, and if a sysroot is set, it should be the shell from that sysroot prefix imho.

What is the reason that EasyBuild is running this in a subshell actually? I mean that is not typically how I would test the module manually and could potentially lead to differences with running it in the parent shell (this example begin a case in point).

@boegel
Copy link
Contributor

boegel commented Aug 28, 2024

@casparvl All shell commands run by EasyBuild are run in a subshell...

@boegel
Copy link
Contributor

boegel commented Aug 28, 2024

A better fix could be what you suggest, in some cases or always, we prefix the exec_cmd = "/bin/bash" (/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/EasyBuild/4.9.2/lib/python3.11/site-packages/easybuild/tools/run.py:229) with sysroot when it is present. Then we could just add the setting of LD_PRELOAD=... in the module file and it should work both while using the module and while running the sanity check.

I think that's the right way forward...

It's a relatively easy change to make in EasyBuild (though in some sense a breaking one, so perhaps we need to make it configurable).

@trz42
Copy link
Collaborator

trz42 commented Aug 28, 2024

We may even test this change already by copying the bash files from the two compat layers (x86_64 and aarch64) to some directory in the PR and then modify the launch of the containers such that the right file is bind mounted to /bin/bash inside the container. Before we run eessi_container.sh we can set SINGULARITY_BIND.

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3

Copy link

eessi-bot bot commented Sep 17, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 resulted in:

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 17, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Sep 17, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_585/18888

date job status comment
Sep 17 20:57:04 UTC 2024 submitted job id 18888 awaits release by job manager
Sep 17 20:57:37 UTC 2024 released job awaits launch by Slurm scheduler
Sep 17 21:04:40 UTC 2024 running job 18888 is running
Sep 17 22:23:56 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-18888.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1726611642.tar.gzsize: 154 MiB (162214378 bytes)
entries: 6200
modules under 2023.06/software/linux/x86_64/amd/zen3/modules/all
gperftools/2.12-GCCcore-12.3.0.lua
imageio/2.33.1-gfbf-2023a.lua
libmad/0.15.1b-GCCcore-12.3.0.lua
NLTK/3.8.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
PyTorch-bundle/2.1.2-foss-2023a.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
SentencePiece/0.2.0-GCC-12.3.0.lua
SoX/14.4.2-GCCcore-12.3.0.lua
tensorboard/2.15.1-gfbf-2023a.lua
software under 2023.06/software/linux/x86_64/amd/zen3/software
gperftools/2.12-GCCcore-12.3.0
imageio/2.33.1-gfbf-2023a
libmad/0.15.1b-GCCcore-12.3.0
NLTK/3.8.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
PyTorch-bundle/2.1.2-foss-2023a
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
SentencePiece/0.2.0-GCC-12.3.0
SoX/14.4.2-GCCcore-12.3.0
tensorboard/2.15.1-gfbf-2023a
other under 2023.06/software/linux/x86_64/amd/zen3
2023.06/init/easybuild/eb_hooks.py
Sep 17 22:23:56 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-18888.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

casparvl commented Sep 18, 2024

Ok, good that the test now works on x86_64.

For this issue on ARM, I made a fix easybuilders/easybuild-framework#4646 for EasyBuild framework, only to realize afterwards that the whole run_cmd thing is completely overhauled in EasyBuild 5.0. Looking at the 5.0.X code here, I see:

    # use bash as shell instead of the default /bin/sh used by subprocess.run
    # (which could be dash instead of bash, like on Ubuntu, see https://wiki.ubuntu.com/DashAsBinSh)
    # stick to None (default value) when not running command via a shell
    if use_bash:
        bash = shutil.which('bash')
        _log.info(f"Path to bash that will be used to run shell commands: {bash}")
        executable, shell = bash, True
    else:
        executable, shell = None, False

I tested a build of SentencePiece, including the LD_PRLOAD hook:

eb --hooks $HOME/EESSI/software-layer/eb_hooks.py SentencePiece-0.2.0-GCC-12.3.0.eb --rebuild

with EasyBuild 5.0.X (from the current branch), and that worked without encountering the previous issue.

In other words, there is not much to fix, we just need to wait for EasyBuild 5.X to be released (soon, I hope :D). Or we need to reinstall 4.9.3 with a patch based on easybuilders/easybuild-framework#4646 so we can proceed here.

@casparvl
Copy link
Collaborator Author

casparvl commented Sep 18, 2024

Hmm, while the issue for SentencePiece is solved (this now installs succesfully), I'm getting

  -- Check for working C compiler: /tmp/eb-cw54zzvr/tmprgti6_vm/rpath_wrappers/gcc_wrapper/gcc - broken
  CMake Error at /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/CMake/3.26.3-GCCcore-12.3.0/share/cmake-3.26/Modules/CMakeTestCCompiler.cmake:67 (message):
    The C compiler

      "/tmp/eb-cw54zzvr/tmprgti6_vm/rpath_wrappers/gcc_wrapper/gcc"

    is not able to compile a simple test program.

    It fails with the following output:

      Change Dir: /tmp/casparvl/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/build/temp.linux-aarch64-cpython-311/CMakeFiles/CMakeScratch/TryCompile-XrjNFV

      Run Build Command(s):/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Ninja/1.11.1-GCCcore-12.3.0/bin/ninja -v cmTC_64b77 && [1/2] /tmp/eb-cw54zzvr/tmprgti6_vm/rpath_wrappers/gcc_wrapper/gcc   -O
2 -ftree-vectorize -mcpu=native -fno-math-errno -o CMakeFiles/cmTC_64b77.dir/testCCompiler.c.o -c /tmp/casparvl/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/build/temp.linux-aarch64-cpython-311/CMakeFiles/CMakeScra
tch/TryCompile-XrjNFV/testCCompiler.c
      FAILED: CMakeFiles/cmTC_64b77.dir/testCCompiler.c.o
      /tmp/eb-cw54zzvr/tmprgti6_vm/rpath_wrappers/gcc_wrapper/gcc   -O2 -ftree-vectorize -mcpu=native -fno-math-errno -o CMakeFiles/cmTC_64b77.dir/testCCompiler.c.o -c /tmp/casparvl/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext
/text-0.16.2/build/temp.linux-aarch64-cpython-311/CMakeFiles/CMakeScratch/TryCompile-XrjNFV/testCCompiler.c
      /bin/sh: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /home/casparvl/eessi/versions/2023.06/software/linux/aarch64/neoverse_n1/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so)
      /bin/sh: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
      /bin/sh: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
      /bin/sh: /lib64/libc.so.6: version `GLIBC_2.36' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
      /bin/sh: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
      /bin/sh: /lib64/libc.so.6: version `GLIBC_2.35' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/GCCcore/12.3.0/lib64/libgcc_s.so.1)
      /bin/sh: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/GCCcore/12.3.0/lib64/libgcc_s.so.1)
      ninja: build stopped: subcommand failed.

when it is installing torchtext from PyTorch-Bundle. I think the /bin/sh here comes from the fact that some python process invokes subprocess.run(), which uses /bin/sh according to https://github.com/easybuilders/easybuild-framework/blob/a2550eb8fab479f517badbf45925c3cebda2880c/easybuild/tools/run.py#L450

The last part of the stack trace I'm getting:

    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
      self.run_command(cmd)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/dist.py", line 1244, in run_command
      super().run_command(command)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 343, in run
      self.run_command("build")
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/dist.py", line 1244, in run_command
      super().run_command(command)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/command/build.py", line 131, in run
      self.run_command(cmd_name)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/dist.py", line 1244, in run_command
      super().run_command(command)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/tmp/casparvl/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/tools/setup_helpers/extension.py", line 46, in run
      super().run()
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 84, in run
      _build_ext.run(self)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
      _build_ext.build_ext.run(self)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
      self.build_extensions()
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
      _build_ext.build_ext.build_extensions(self)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
      self._build_extensions_serial()
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
      self.build_extension(ext)
    File "/tmp/casparvl/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/tools/setup_helpers/extension.py", line 108, in build_extension
      subprocess.check_call(["cmake", str(_ROOT_DIR)] + cmake_args, cwd=self.build_temp)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/subprocess.py", line 413, in check_call
      raise CalledProcessError(retcode, cmd)

That's annoying to say the least. We can fix it, but it might require a patch to Python to alter which sh command is used by default by subprocess.run. Alternatively, we change the subprocess call done by /tmp/casparvl/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/tools/setup_helpers/extension.py. That's much smaller impact, but also a less complete fix. It means that any other software using SentencePiece and calling subprocess.run will still run into this issue.

@boegel
Copy link
Contributor

boegel commented Sep 18, 2024

In other words, there is not much to fix, we just need to wait for EasyBuild 5.X to be released (soon, I hope :D). Or we need to reinstall 4.9.3 with a patch based on easybuilders/easybuild-framework#4646 so we can proceed here.

@casparvl There's an EasyBuild v4.9.4 release coming really soon (in next couple of days), because the GCC easyblock in EasyBuild v4.9.3 has a serious bug that many people will easily run into (see here), so it's worth trying to get easybuilders/easybuild-framework#4646 merged ASAP.

@boegel
Copy link
Contributor

boegel commented Sep 18, 2024

That's annoying to say the least. We can fix it, but it might require a patch to Python to alter which sh command is used by default by subprocess.run. Alternatively, we change the subprocess call done by /tmp/casparvl/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/tools/setup_helpers/extension.py. That's much smaller impact, but also a less complete fix. It means that any other software using SentencePiece and calling subprocess.run will still run into this issue.

@casparvl A patch to Python seems like the best way forward here.
We should check what Gentoo Prefix does here, since they must have run into similar issues with a hardcoded /bin/sh?

@casparvl
Copy link
Collaborator Author

From the sources, it seems to be equally broken in Gentoo Prefix:

$ cat /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/lib/python3.11/subprocess.py | grep -A 5 "/bin/sh"
    >>> check_output(["/bin/sh", "-c",
    ...               "ls -l non_existent_file ; exit 0"],
    ...              stderr=STDOUT)
    b'ls: non_existent_file: No such file or directory\n'

    There is an additional optional argument, "input", allowing you to
--
                # On Android the default shell is at '/system/bin/sh'.
                unix_shell = ('/system/bin/sh' if
                          hasattr(sys, 'getandroidapilevel') else '/bin/sh')
                args = [unix_shell, "-c"] + args
                if executable:
                    args[0] = executable

            if executable is None:

@casparvl
Copy link
Collaborator Author

I confirmed that if I run a subprocess.run("sleep 5", shell=True) with the python from the compat layer, it will use /bin/sh to execute this command. So yes, it's just as broken in the Python in Gentoo-Prefix.

The fix should be very simple: prepend the sysroot to the path on this line in the source code. I guess this could (and should) be done at the EasyBlock level. I'll look at that later...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants