-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{2023.06}[2023a] PyTorch-Bundle v2.1.2 #585
base: 2023.06-software.eessi.io
Are you sure you want to change the base?
{2023.06}[2023a] PyTorch-Bundle v2.1.2 #585
Conversation
casparvl
commented
May 23, 2024
•
edited by trz42
Loading
edited by trz42
Instance
|
Instance
|
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
I guess that with |
I (and @trz42 and @ocaisa ) also saw issues with using |
Could you try using the merge commit (see bottom of the PR: 04ccd901a613631b00ccbe504d6d66d6a6c2febb) and check if that does work? |
I tried manually
But that still shows missing EasyConfigs. |
Guess we need to stick to |
I was being stupid. I made a mistake in what I ran manually: that's with CUDA. That's not included in that PR/commit for sure... :P However,
shows the same missing easyconfigs. I've switched to |
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
This is the actual failure:
I guess this should be provide by the module I guess this is a pretty fundamental question: how do we make |
See #192 , the Alliance have a solution for this |
Spot on, it is indeed the issue of The downside is that the Alliance's solution looks quite involved... The upside is we can probably use their shadowing lib from https://github.com/ComputeCanada/custom_ctypes/tree/main/lib . What I don't fully understand is the I guess my main consideration would be if we shouldn't just always have this patched |
I was also thinking that maybe a patch on ctypes is enough, I don't fully understand all the other stuff going on with them |
The changes they apply to diff -u /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/util.py custom_ctypes/lib/python3.11/site-packages/ctypes/util.py
--- /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/util.py 2024-04-30 16:38:09.000000000 +0200
+++ custom_ctypes/lib/python3.11/site-packages/ctypes/util.py 2024-05-30 16:17:44.000000000 +0200
@@ -326,7 +326,10 @@
def find_library(name):
# See issue #9998
+ lib = _findLib_gcc(name)
+ # return absolute path
return _findSoname_ldconfig(name) or \
+ os.path.join(os.path.dirname(lib), _get_soname(lib)) or \
_get_soname(_findLib_gcc(name)) or _get_soname(_findLib_ld(name))
################################################################ |
I tried to replace the
Will try to use that modified file only when building/using |
I've worked out a fix for the If it works out there, I'll test it with PyTorch-bundle. We can dicuss how we should employ this fix (maybe it's better to ship the custom |
I updated NorESSI#387 with the fixes in NorESSI#391 to work around the failing sanity check ( |
@trz42 I remember you said in a meeting that simply patching I was thinking: what if we patch Now, this would be super annoying if there are packages that do a lot of |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Failure of the test suite on x86_64 with:
Ok, we didn't define that in our template config file. Also, it is particular to newer versions of ReFrame. I'll create a PR that adds a new version of ReFrame and I'll create a PR that no longer uses hard-coded processor features, but autodetects them. The challenge is that with the local spawner, if we use a single config file, it doesn't have the specific partition we submitted to. But, we can get that from the job environment and inject it in the config. I'll do that in #682 and a new ReFrame in #708 |
Copying some findings from Slack here: To me it seems the problem is a combination of what EasyBuild uses to run commands (it uses The original TLS (Thread-Local Storage) allocation error... (withou bot@aarch64-generic-node3 /tmp/bot $ python -c 'import sentencepiece'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/SentencePiece/0.2.0-GCC-12.3.0/lib/python3.11/site-packages/sentencepiece/__init__.py", line 10, in <module>
from . import _sentencepiece
ImportError: /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4: cannot allocate memory in static TLS block With bot@aarch64-generic-node3 /tmp/bot $ LD_PRELOAD=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so python -c 'import sentencepiece' However, that is not how EasyBuild runs the sanitycheck command. It rather runs the following (which fails)... bot@aarch64-generic-node3 /tmp/bot $ LD_PRELOAD=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so /bin/bash -c "python -c 'import sentencepiece'"
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.36' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.35' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libgcc_s.so.1)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libgcc_s.so.1) The above error is what we got in the last build job for bot@aarch64-generic-node3 /tmp/bot $ /bin/bash -c "python -c 'import sentencepiece'"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/SentencePiece/0.2.0-GCC-12.3.0/lib/python3.11/site-packages/sentencepiece/__init__.py", line 10, in <module>
from . import _sentencepiece
ImportError: /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4: cannot allocate memory in static TLS block If we set bot@aarch64-generic-node3 /tmp/bot $ /bin/bash -c "LD_PRELOAD=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so python -c 'import sentencepiece'" I think, setting bot@aarch64-generic-node3 /tmp/bot $ LD_PRELOAD=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4 /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/bin/bash -c "python -c 'import sentencepiece'" To me it seems that |
@trz42 Doesn't this mean that EasyBuild should be using the |
Maybe. If I'm trying to solve the issue with a parse hook where I just add A better fix could be what you suggest, in some cases or always, we prefix the |
To me, this makes a lot of sense actually. If you're explicitly invoking a shell to run your command, and if a What is the reason that EasyBuild is running this in a subshell actually? I mean that is not typically how I would test the module manually and could potentially lead to differences with running it in the parent shell (this example begin a case in point). |
@casparvl All shell commands run by EasyBuild are run in a subshell... |
I think that's the right way forward... It's a relatively easy change to make in EasyBuild (though in some sense a breaking one, so perhaps we need to make it configurable). |
We may even test this change already by copying the |
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Ok, good that the test now works on For this issue on ARM, I made a fix easybuilders/easybuild-framework#4646 for EasyBuild framework, only to realize afterwards that the whole
I tested a build of SentencePiece, including the
with EasyBuild 5.0.X (from the current branch), and that worked without encountering the previous issue. In other words, there is not much to fix, we just need to wait for EasyBuild 5.X to be released (soon, I hope :D). Or we need to reinstall 4.9.3 with a patch based on easybuilders/easybuild-framework#4646 so we can proceed here. |
Hmm, while the issue for SentencePiece is solved (this now installs succesfully), I'm getting
when it is installing The last part of the stack trace I'm getting:
That's annoying to say the least. We can fix it, but it might require a patch to Python to alter which |
@casparvl There's an EasyBuild v4.9.4 release coming really soon (in next couple of days), because the GCC easyblock in EasyBuild v4.9.3 has a serious bug that many people will easily run into (see here), so it's worth trying to get easybuilders/easybuild-framework#4646 merged ASAP. |
@casparvl A patch to Python seems like the best way forward here. |
From the sources, it seems to be equally broken in Gentoo Prefix:
|
I confirmed that if I run a The fix should be very simple: prepend the sysroot to the path on this line in the source code. I guess this could (and should) be done at the EasyBlock level. I'll look at that later... |