Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sanity check for TensorFlow-2.6.0-foss-2021a.eb fails (download problem when running TensorFlow-2.x_mnist-test.py test case) #14058

Closed
avapirev opened this issue Sep 23, 2021 · 20 comments
Milestone

Comments

@avapirev
Copy link

avapirev commented Sep 23, 2021

The error is past the sanity check during the install phase:

== 2021-09-22 11:19:49,518 extensioneasyblock.py:181 INFO Sanity check for TensorFlow successful!
== 2021-09-22 11:19:49,519 environment.py:91 INFO Environment variable PYTHONPATH set to /apps/leuven/skylake/2021a/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/typing-extensions/3.10.0.0-GCCcore-10.3.0/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/flatbuffers-python/2.0-GCCcore-10.3.0/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/protobuf-python/3.17.3-GCCcore-10.3.0/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/h5py/3.2.1-foss-2021a/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/pybind11/2.6.2-GCCcore-10.3.0/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/easybuild/python (previous value: '/apps/leuven/skylake/2021a/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/typing-extensions/3.10.0.0-GCCcore-10.3.0/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/flatbuffers-python/2.0-GCCcore-10.3.0/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/protobuf-python/3.17.3-GCCcore-10.3.0/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/h5py/3.2.1-foss-2021a/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/pybind11/2.6.2-GCCcore-10.3.0/lib/python3.9/site-packages:/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/easybuild/python')
== 2021-09-22 11:19:49,523 filetools.py:2294 INFO /data/leuven/sys/x0076666/git/easybuild-easyconfigs/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.x_mnist-test.py copied to /dev/shm/x0076666/easybuild/TensorFlow/2.6.0/foss-2021a/TensorFlow-2.x_mnist-test.py
== 2021-09-22 11:19:49,523 run.py:233 INFO running cmd: python /dev/shm/x0076666/easybuild/TensorFlow/2.6.0/foss-2021a/TensorFlow-2.x_mnist-test.py
== 2021-09-22 11:19:55,459 build_log.py:169 ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:124 in __init__): cmd "python /dev/shm/x0076666/easybuild/TensorFlow/2.6.0/foss-2021a/TensorFlow-2.x_mnist-test.py" exited with exit code 1 and output:
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Traceback (most recent call last):
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/urllib/request.py", line 1346, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/http/client.py", line 1253, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/http/client.py", line 1299, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/http/client.py", line 1248, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/http/client.py", line 1008, in _send_output
    self.send(msg)
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/http/client.py", line 948, in send
    self.connect()
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/http/client.py", line 1422, in connect
    self.sock = self._context.wrap_socket(self.sock,
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/apps/leuven/skylake/2021a/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages/keras/utils/data_utils.py", line 274, in get_file
    urlretrieve(origin, fpath, dl_progress)
  File "/apps/leuven/skylake/2021a/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages/keras/utils/data_utils.py", line 82, in urlretrieve
    response = urlopen(url, data)
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/urllib/request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/urllib/request.py", line 517, in open
    response = self._open(req, data)
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/urllib/request.py", line 534, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/urllib/request.py", line 1389, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/apps/leuven/skylake/2021a/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/urllib/request.py", line 1349, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/dev/shm/x0076666/easybuild/TensorFlow/2.6.0/foss-2021a/TensorFlow-2.x_mnist-test.py", line 8, in <module>
    (x_train, y_train), (x_test, y_test) = mnist.load_data()
  File "/apps/leuven/skylake/2021a/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages/keras/datasets/mnist.py", line 71, in load_data
    path = get_file(
  File "/apps/leuven/skylake/2021a/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages/keras/utils/data_utils.py", line 278, in get_file
    raise Exception(error_msg.format(origin, e.errno, e.reason))
Exception: URL fetch failure on https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz: None -- [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)

EDIT:
Simply pasting the above mnist.npz link in a browser downloads the file.

@akesandgren
Copy link
Contributor

You need the ca-certificates package installed

@avapirev
Copy link
Author

You mean system-wise? If so, that is out of my hands.

@avapirev
Copy link
Author

ca-certificates is a static rpm providing the trusted CA's, it can not crash, rather not provide the right certificates.
Is it possible that we miss the specification of the CA bundle in TensorFlow, e.g., /etc/ssl/certs/ca-bundle.crt ?

@akesandgren
Copy link
Contributor

The reason for the crash above is that if can't find the CA cert when doing a secure download. The only reason for that I can think of is that the RPM is not installed.

@avapirev
Copy link
Author

The RPM is there and working on all nodes. Just confirmed with the sysdamin.

@boegel boegel added this to the 4.x milestone Sep 30, 2021
@boegel
Copy link
Member

boegel commented Sep 30, 2021

@avapirev Let's try and narrow this down a bit...

Can you provide some more information about the system on which you're seeing this? Which OS, etc.?
Please share the output of eb --version, eb --show-system-info, and eb --show-config.

Does curl -OL https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz works on that system, without having any modules loaded?
If that doesn't work, you have a system-wide SSL problem to fix.

If that works, please try running this Python code, after loading the Python/3.9.5-GCCcore-10.3.0 module:

from urllib.request import urlopen

res = urlopen('https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz')
print(res.status)

Output should be 200 (which indicates success in accessing the URL).
If not, there may be something wrong with that specific Python installation (missing SSL support, for example).

@boegel boegel changed the title TensorFlow-2.6.0-foss-2021a.eb fails to build sanity check for TensorFlow-2.6.0-foss-2021a.eb fails (download problem when running TensorFlow-2.x_mnist-test.py test case) Sep 30, 2021
@avapirev
Copy link
Author

avapirev commented Sep 30, 2021

$ eb --version
This is EasyBuild 4.4.2 (framework: 4.4.2, easyblocks: 4.4.2) on host login1.
$ eb --show-system-info
System information (login1):

* OS:
  -> name: centos linux
  -> type: Linux
  -> version: 7.6.1810
  -> platform name: x86_64-unknown-linux

* CPU:
  -> vendor: Intel
  -> architecture: x86_64
  -> family: Intel
  -> arch name: UNKNOWN (archspec is not installed?)
  -> model: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
  -> speed: 2401.0
  -> cores: 56
  -> features: 3dnowprefetch,abm,acpi,adx,aes,aperfmperf,apic,arat,arch_perfmon,avx,avx2,bmi1,bmi2,bts,cat_l3,cdp_l3,clflush,cmov,constant_tsc,cqm,cqm_llc,cqm_mbm_local,cqm_mbm_total,cqm_occup_llc,cx16,cx8,dca,de,ds_cpl,dtes64,dtherm,dts,eagerfpu,epb,ept,erms,est,f16c,flexpriority,fma,fpu,fsgsbase,fxsr,hle,ht,ida,intel_ppin,intel_pt,invpcid,lahf_lm,lm,mca,mce,mmx,monitor,movbe,msr,mtrr,nonstop_tsc,nopl,nx,pae,pat,pbe,pcid,pclmulqdq,pdcm,pdpe1gb,pebs,pge,pln,pni,popcnt,pse,pse36,pts,rdrand,rdseed,rdt_a,rdtscp,rep_good,rtm,sdbg,sep,smep,smx,ss,sse,sse2,sse4_1,sse4_2,ssse3,syscall,tm,tm2,tpr_shadow,tsc,tsc_adjust,vme,vmx,vnmi,vpid,x2apic,xsave,xsaveopt,xtopology,xtpr

* software:
  -> glibc version: 2.17
  -> Python binary: /usr/bin/python2
  -> Python version: 2.7.5
$ eb --show-config
#
# Current EasyBuild configuration
# (C: command line argument, D: default value, E: environment variable, F: configuration file)
#
buildpath      (F) = /dev/shm/x0076666/easybuild
configfiles    (C) = /data/leuven/sys/x0076666/easybuild/easybuild-breniac-2021a-broadwell.cfg
containerpath  (F) = /apps/leuven/broadwell/2021a/containers
installpath    (F) = /apps/leuven/broadwell/2021a
module-syntax  (F) = Tcl
modules-tool   (F) = EnvironmentModulesC
packagepath    (F) = /apps/leuven/broadwell/2021a/packages
prefix         (F) = /apps/leuven/broadwell/2021a/
repositorypath (F) = /apps/leuven/broadwell/2021a/ebfiles_repo
robot          (F) = /data/leuven/sys/x0076666/git/easybuild-easyconfigs/easybuild/easyconfigs, /data/leuven/sys/x0076666/.local/software/EasyBuild/4.4.2/easybuild/easyconfigs
robot-paths    (F) = /data/leuven/sys/x0076666/git/easybuild-easyconfigs/easybuild/easyconfigs, /data/leuven/sys/x0076666/.local/software/EasyBuild/4.4.2/easybuild/easyconfigs
sourcepath     (F) = /apps/leuven/sources/
umask          (F) = 002

This works:

curl -OL https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz 
This command fails with the above-mentioned certificate error
>>> res = urlopen('https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz')
However, the certificate package is there (in the python install)
>>> import certifi

@casparvl
Copy link
Contributor

casparvl commented Oct 6, 2021

+1, I have the same issue. My system info:

[casparl@tcn2 ~]$ eb --version
This is EasyBuild 4.4.2 (framework: 4.4.2, easyblocks: 4.4.2) on host tcn2.
[casparl@tcn2 ~]$ eb --show-system-info
System information (tcn2):

* OS:
  -> name: centos linux
  -> type: Linux
  -> version: 8.4.2105
  -> platform name: x86_64-unknown-linux

* CPU:
  -> vendor: AMD
  -> architecture: x86_64
  -> family: AMD
  -> arch name: UNKNOWN (archspec is not installed?)
  -> model: AMD EPYC 7H12 64-Core Processor
  -> speed: 2600.0
  -> cores: 128
  -> features: 3dnowprefetch,abm,adx,aes,amd_ppin,aperfmperf,apic,arat,avic,avx,avx2,bmi1,bmi2,bpext,cat_l3,cdp_l3,clflush,clflushopt,clwb,clzero,cmov,cmp_legacy,constant_tsc,cpb,cpuid,cqm,cqm_llc,cqm_mbm_local,cqm_mbm_total,cqm_occup_llc,cr8_legacy,cx16,cx8,de,decodeassists,extapic,extd_apicid,f16c,flushbyasid,fma,fpu,fsgsbase,fxsr,fxsr_opt,ht,hw_pstate,ibpb,ibrs,ibs,irperf,lahf_lm,lbrv,lm,mba,mca,mce,misalignsse,mmx,mmxext,monitor,movbe,msr,mtrr,mwaitx,nonstop_tsc,nopl,npt,nrip_save,nx,osvw,overflow_recov,pae,pat,pausefilter,pclmulqdq,pdpe1gb,perfctr_core,perfctr_llc,perfctr_nb,pfthreshold,pge,pni,popcnt,pse,pse36,rdpid,rdrand,rdseed,rdt_a,rdtscp,rep_good,sep,sha_ni,skinit,smap,smca,smep,ssbd,sse,sse2,sse4_1,sse4_2,sse4a,ssse3,stibp,succor,svm,svm_lock,syscall,tce,topoext,tsc,tsc_scale,umip,v_vmsave_vmload,vgif,vmcb_clean,vme,vmmcall,wbnoinvd,wdt,xgetbv1,xsave,xsavec,xsaveerptr,xsaveopt,xsaves

* software:
  -> glibc version: 2.28
  -> Python binary: /usr/bin/python3
  -> Python version: 3.6.8
[casparl@tcn2 ~]$ eb --show-config
#
# Current EasyBuild configuration
# (C: command line argument, D: default value, E: environment variable, F: configuration file)
#
buildpath                 (F) = /gpfs/scratch1/casparl
configfiles               (E) = /gpfs/admin/hpc/sw/arch/NOARCH/Centos8/2021/software/eb/4.4.2/etc/config.cfg
containerpath             (D) = /home/casparl/.local/easybuild/containers
cuda-compute-capabilities (F) = 8.0
download-timeout          (F) = 60.0
experimental              (F) = True
github-org                (F) = sara-nl
installpath               (D) = /home/casparl/.local/easybuild
installpath-modules       (F) = /sw/noarch/Centos8/2021/modulefiles
installpath-software      (F) = /sw/noarch/Centos8/2021/software
minimal-toolchains        (F) = True
optarch                   (F) = {'Intel': 'O2 -march=core-avx2', 'GCC': 'O2 -mavx2'}
repositorypath            (D) = /home/casparl/.local/easybuild/ebfiles_repo
robot-paths               (F) = /sw/eb/easyconfigs-surf, /gpfs/admin/hpc/sw/arch/NOARCH/Centos8/2021/software/EasyBuild/4.4.2/easybuild/easyconfigs
rpath                     (F) = True
set-gid-bit               (F) = True
sourcepath                (D) = /home/casparl/.local/easybuild/sources
tmpdir                    (F) = /gpfs/scratch1/casparl
umask                     (F) = 022
use-existing-modules      (F) = True

This remark from @boegel

If not, there may be something wrong with that specific Python installation (missing SSL support, for example).

Triggered something in me: at first, we didn't have OpenSSL's development headers installed in our system package OpenSSL. Thus, EasyBuild would build it's own OpenSSL. I think this was the OpenSSL our Python was built against. Then, I ran into the issue that TensorFlow threw this error during the build:

In file included from external/com_github_grpc_grpc/src/core/tsi/ssl/session_cache/ssl_session_boringssl.cc:21:
external/com_github_grpc_grpc/src/core/tsi/ssl/session_cache/ssl_session.h:29:10: fatal error: openssl/ssl.h: No such file or directory
   29 | #include <openssl/ssl.h>
      |          ^~~~~~~~~~~~~~~
compilation terminated.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
ERROR: /tmp/jenkins/build/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/TensorFlow/tensorflow-2.6.0/tensorflow/lite/toco/python/BUILD:89:10 C++ compilation of rule '@com_github_grpc_grpc//:tsi' failed (Exit 1): cro
sstool_wrapper_driver_is_not_gcc failed: error executing command

The build logs showed that OpenSSL was in the LD_LIBRARY_PATH, but not in the CPATH. I figured no one has hit this before because probably, most people use their OS OpenSSL. Thus, I asked my sysadmins to install openssl-devel, et voila, the build of TensorFlow now succeeded - only to crash during the sanity check with the aforementioned issue.

Maybe it's because now, my Python was built against EasyBuild's OpenSSL, but now that we have a system one, that messes things up? @avapirev any chance that you went through something similar? If so, that would be a good indication that this might be the cause.

I'd love to try and built a completely new stack now that we have the OS OpenSSL headers in place, but since this is on our new system, that's a bit of a challenge (filesystems are not really stable yet). If I do at some point manage to rebuild the full stack, I'll let you know if it helped.

@casparvl
Copy link
Contributor

casparvl commented Oct 6, 2021

First, loading Python/3.9.5-GCCcore-10.3.0 and trying @boegel 's minimal example:

>>> from urllib.request import urlopen
>>>
>>> res = urlopen('https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz')
Traceback (most recent call last):
  File "/tmp/sw_stack/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/urllib/request.py", line 1346, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/tmp/sw_stack/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/http/client.py", line 1253, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/tmp/sw_stack/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/http/client.py", line 1299, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/tmp/sw_stack/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/http/client.py", line 1248, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/tmp/sw_stack/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/http/client.py", line 1008, in _send_output
    self.send(msg)
  File "/tmp/sw_stack/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/http/client.py", line 948, in send
    self.connect()
  File "/tmp/sw_stack/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/http/client.py", line 1422, in connect
    self.sock = self._context.wrap_socket(self.sock,
  File "/tmp/sw_stack/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/tmp/sw_stack/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/tmp/sw_stack/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)

Then, I reinstalled OpenSSL-1.1.eb (which will now wrap the system OpenSSL, instead of install the EasyBuild one) and loading Python/3.9.5-GCCcore-10.3.0 made this work for me:

>>> from urllib.request import urlopen
>>>
>>> res = urlopen('https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz')
>>> print(res.status)
200

So, I don't think this issue is TensorFlow related at all, it's more related to the EasyBuild installation of OpenSSL-1.1.

I haven't tried reinstalling TF yet, but my bet is that this is the solution:

  • Install OpenSSL (including devel headers) at the system level
  • Reinstall OpenSSL-1.1.eb, and check that it now wraps your system OpenSSL (i.e. .../lib64 should contain softlinks to your system OpenSSL)
  • Retry the TensorFlow installation

@casparvl
Copy link
Contributor

casparvl commented Oct 6, 2021

Small update: I can confirm that the three steps I described above fixed the issue for me.

@boegel
Copy link
Member

boegel commented Oct 6, 2021

This works:

curl -OL https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz 
This command fails with the above-mentioned certificate error
>>> res = urlopen('https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz')
However, the certificate package is there (in the python install)
>>> import certifi

@avapirev So, it's not a system-wide error, but an issue specific to the Python/3.9.5-GCCcore-10.3.0 installation you are using. Does python -c 'import _ssl' work correctly for that Python installation?
Can you share the EasyBuild log file for that installation?

@avapirev
Copy link
Author

avapirev commented Oct 7, 2021

FYI: This works fine for me with no need to recompile OpenSSL

from urllib.request import urlopen

@boegel
Copy link
Member

boegel commented Oct 13, 2021

@avapirev Does just the import work, or also downloading a file over HTTPS via urlopen?

@avapirev
Copy link
Author

Maybe I forgot to reply. No, the _ssl import does not work:

python
Python 3.9.5 (default, Jun 30 2021, 19:01:23)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

python -c 'import _ssl'
File "", line 1
python -c 'import _ssl'
^
SyntaxError: invalid syntax

@seb45tian
Copy link
Contributor

seb45tian commented Jan 13, 2022

Also ran into this and can confirm @casparvl steps are working. We are still on CentOS7.9 and I had to install openssl11 and openssl11-dev openssl-devel before, otherwise it would not wrap the system libraries. However, shouldn't this also work when using the EB build version of OpenSSL-1.1?

@verdurin
Copy link
Member

I've just bumped against this, and second the question from @seb45tian

@boegel
Copy link
Member

boegel commented Jun 4, 2022

I'm seeing this same issue now in the CentOS 7.9 container I'm starting to use for regression tests (since all our systems are now RHEL8)...

So I'll try to take a look what going on here.

@boegel
Copy link
Member

boegel commented Jun 4, 2022

Problem reproduced in CentOS 7.9 container:

Singularity> python -c "from urllib.request import urlopen; res = urlopen('https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz'); print(res.status)"
...
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)>

Issue is not the ca-certificates package, since that's installed and sufficiently recent (see also https://aws.amazon.com/premiumsupport/knowledge-center/ec2-expired-certificate/ + https://www.openssl.org/blog/blog/2021/09/13/LetsEncryptRootCertExpire):

Singularity> rpm -q ca-certificates
ca-certificates-2021.2.50-72.el7_9.noarch

I only have OpenSSL 1.0.2 installed as OS package, OpenSSL 1.1 is not there (currently) in the container:

Singularity> rpm -q openssl
openssl-1.0.2k-25.el7_9.x86_64
Singularity> rpm -q openssl11
package openssl11 is not installed

The fix in easybuilders/easybuild-easyblocks#2575 looks like it could be related (TensorFlow + OpenSSL), so I'll try that first...
edit: Since the problem can be reproduced using only Python, that fix for the TensorFlow easyblock obviously won't fix this problem

@boegel
Copy link
Member

boegel commented Jun 4, 2022

OK, spent quite a bit of time debugging this today, and it seems like I've got it figured out...

It boils down to a bug in then from-source OpenSSL installation that is provided by OpenSSL/1.1 if the openssl11 package is not installed.
The changes in easybuilders/easybuild-easyblocks#2683 to make a from-source OpenSSL installation actually pick up on the system certificates is not complete.

Just symlinking $EBROOTOPENSSL/ssl/certs to OPENSSLDIR/certs (where OPENSSLDIR is determine by the output of openssl version -d, with /etc/ssl/certs as a fallback, is not sufficient.
We should also be symlinking $EBROOTOPENSSL/ssl/cert.pem to OPENSSLDIR/cert.pem (to /etc/pki/tls/cert.pem for example in CentOS 7.9).

The cert.pem file is owned by the ca-certificates package:

Singularity> rpm -qf /etc/pki/tls/cert.pem
ca-certificates-2021.2.50-72.el7_9.noarch

Symlinking cert.pem manually in $EBROOTOPENSSL when OpenSSL/1.1 fixes the problem for me.

First, without this change:

module load OpenSSL/1.1
Singularity> echo | openssl s_client -connect github.com:443 -verify 9 2>&1 | grep 'Verify return code'
Verify return code: 20 (unable to get local issuer certificate)

(stopping the openssl command requires Ctrl-C)

Then, adding the missing symlink:

cd $EBROOTOPENSSL/ssl
ln -s /etc/pki/tls/cert.pem

recheck:

Singularity> echo | openssl s_client -connect github.com:443 -verify 9 2>&1 | grep 'Verify return code'
Verify return code: 0 (ok)

and checking again with Python:

Singularity> module load Python/3.9.5-GCCcore-10.3.0
Singularity> python -c "from urllib.request import urlopen; res = urlopen('https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz'); print(res.status)"
200

So, we should:

  1. Make the OpenSSL easyblock also add the symlink for cert.pem (needs to be checked if this is needed on all OSs);
  2. Add a sanity check command to the EB_OpenSSL (and also EB_OpenSSL_wrapper?) easyblock to catch this.

cc @lexming

edit:

  • On Ubuntu 20.04, there is no cert.pem, and the verification command works fine for both OpenSSL/1.0 (built from source) and OpenSSL/1.1 (symlinked to system OpenSSL), so the symlinking of cert.pem should only be done if it exists on the system in OPENSSLDIR
  • On OpenSUSE 15.3 there is no cert.pem in OPENSSLDIR (/etc/ssl). There is a ca-bundle.pem, but the verification command works fine even if no symlinking is done in the OpenSSL install dir.

@boegel
Copy link
Member

boegel commented Jun 8, 2022

The problem with OpenSSL is fixed with the updated easyblock in https://github.com/easybuilders/easybuild-easyblocks, which is included in EasyBuild v4.5.5.

So, regardless of whether OpenSSL is wrapping the system OpenSSL, or whether it's a from-source installation, reinstalling OpenSSL with EasyBuild v4.5.5 should fix the problem that's being reported here, so I'll close this issue.

@boegel boegel closed this as completed Jun 8, 2022
@boegel boegel modified the milestones: 4.x, 4.5.5 Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants