Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow builds fail on JURECA and JUWELS #13967

Open
SebastianAchilles opened this issue Sep 10, 2021 · 8 comments
Open

TensorFlow builds fail on JURECA and JUWELS #13967

SebastianAchilles opened this issue Sep 10, 2021 · 8 comments
Milestone

Comments

@SebastianAchilles
Copy link
Member

Follow up on the failing test reports in #13877

Building TensorFlow 2.4.1, 2.5.0 and 2.6.0 fails on JURECA and JUWELS, but not on my minimalist rocky 8.4 container.

EB Config and system info

Test Reports:

I am trying to understand why the builds fail on JURECA and JUWELS. And whether this is caused by to the specific configuration of these systems. Maybe @Flamefire or @boegel can give me a hint to better understand the problem?

@boegel
Copy link
Member

boegel commented Sep 10, 2021

@SebastianAchilles Anything useful in the logs for the failing csv_dataset_test?

@SebastianAchilles
Copy link
Member Author

SebastianAchilles commented Sep 10, 2021

2021-09-09 20:56:46.792936: F tensorflow/core/platform/default/env.cc:73] Check failed: ret == 0 (11 vs. 0)Thread tf_data_private_threadpool creation via pthread_create() failed.
Fatal Python error: Aborted

This looks like there is a problem with the creation of the threads. I assume this is caused by our (too?) strict limitations on the login nodes.
On the backend nodes I can not build TensorFlow at the moment because UnZip is not in our lightweight OS image, but is required for building. I have already opened PRs for this:

@SebastianAchilles
Copy link
Member Author

It seems like

$ ulimit -u
1024

caused the problem on the login nodes. A larger values would be needed for running the tests. @boegel you use 10k, right?

With UnZip added, I get a bit further on the backend node: Now I see

external/com_github_grpc_grpc/src/core/tsi/alts/crypt/aes_gcm.cc:25:10: fatal error: openssl/bio.h: No such file or directory
   25 | #include <openssl/bio.h>
      |          ^~~~~~~~~~~~~~~
compilation terminated.

c.f. #13960 (comment)

@boegel Do you have openssl-devel installed on your system? On the lightweight OS of the backend nodes we only have:

$ yum list installed | grep openssl
compat-openssl10.x86_64                                  1:1.0.2o-3.el8                             @AppStream   
openssl.x86_64                                           1:1.1.1g-15.el8_3                          @BaseOS      
openssl-libs.x86_64                                      1:1.1.1g-15.el8_3                          @BaseOS      
openssl-pkcs11.x86_64                                    0.4.10-2.el8                               @BaseOS

OpenSSL-1.1.eb used the fallback option and installed OpenSSL-1.1.1k.eb with EasyBuild. But it seems like openssl-devel is still required as OS dependency.

@Flamefire
Copy link
Contributor

With UnZip added, I get a bit further on the backend node

What was the error when that was not added?

But it seems like openssl-devel is still required as OS dependency.

Bazel sends its greetings... Will send an easyblock PR to fix this.

@branfosj
Copy link
Member

With UnZip added, I get a bit further on the backend node

Did it fail on Dill? We use 'source_tmpl': '%(name)s-%(version)s.zip', for that - so the UnZip would be needed to unpack that if you do not have that installed in the OS. This works because the build dependencies are loaded before extensions are unpacked.

Also, does SciPy-bundle fail in the same way? There numpy is a SOURCE_ZIP.

@SebastianAchilles
Copy link
Member Author

With UnZip added, I get a bit further on the backend node

Did it fail on Dill?

Yes, it failed on Dill. On the lightweight OS of the backend nodes we do not have UnZip installed.

Also, does SciPy-bundle fail in the same way? There numpy is a SOURCE_ZIP.

Yes, I got the same error unzip not found with SciPy-bundle on our backend nodes.

But it seems like openssl-devel is still required as OS dependency.

Bazel sends its greetings... Will send an easyblock PR to fix this.

Thank you!

@Flamefire
Copy link
Contributor

@SebastianAchilles
Copy link
Member Author

@SebastianAchilles Can you test with easybuilders/easybuild-easyblocks#2575?

I will start a test report after 6PM. Today we have maintenance day on all systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants