Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very frequent segfaults with the new netCDF4=1.6.1 #1192

Open
valeriupredoi opened this issue Sep 15, 2022 · 32 comments
Open

Very frequent segfaults with the new netCDF4=1.6.1 #1192

valeriupredoi opened this issue Sep 15, 2022 · 32 comments

Comments

@valeriupredoi
Copy link

valeriupredoi commented Sep 15, 2022

Heads up guys, we are seeing some very frequent segfaults in our CI when we have the new, hours-old, netCDF4=1.6.1 in our environment. It's most probably due to it, since HDF5 has been at 1.12.2 for a while now - more than a month, and with netCDF4=1.6.0 all works fine (and other packages staying at the same version and hash point). Apologies if this proves out to be due to a different package, but better safe than sorry in terms of a forewarning. Cheers muchly 🍺

@ocefpaf
Copy link
Collaborator

ocefpaf commented Sep 15, 2022

Looks like you are using conda-forge's netcdf4. Maybe open an issue at https://github.com/conda-forge/netcdf4-feedstock instead.

PS: could you also test the wheels just to be sure they are OK?

@valeriupredoi
Copy link
Author

@ocefpaf good call, mate! Will do so, cheers 🍺

@neutrinoceros
Copy link
Contributor

We're also having issues on yt with the windows wheels for version 1.6.1. Namely, h5py is raising a warning at import. See yt-project/yt#4128

@trexfeathers
Copy link

Same with us - see SciTools/iris#4968. Sometimes manifests as segfaults, sometimes as crashed GHA workers (maybe segfault underneath).

@ocefpaf I've confirmed the same problems appear when installing from PyPI OR from conda-forge.

@valeriupredoi
Copy link
Author

@ocefpaf I've confirmed the same problems appear when installing from PyPI OR from conda-forge.

Many thanks, I was about to test the PyPi version - cheers for testing, that saves me some lunch time 😁

@ocefpaf
Copy link
Collaborator

ocefpaf commented Sep 16, 2022

@trexfeathers what platforms are failing when you tested the PyPI wheels? I'm particularly interested in the Windows wheels for 1.6.1 b/c those are built in a different way now.

@trexfeathers
Copy link

@trexfeathers what platforms are failing when you tested the PyPI wheels? I'm particularly interested in the Windows wheels for 1.6.1 b/c those are built in a different way now.

GHA's ubuntu-latest. Despite several attempts we are yet to get Iris' test suite working on Windows.

@neutrinoceros
Copy link
Contributor

@ocefpaf to clarify, on yt we're testing with PyPI wheels for all three major platforms, and we're only seeing issues on windows.

@valeriupredoi
Copy link
Author

ah I realized I've not clarified it myself: segfaults from a conda-forge install on both ubuntu-latest, OSX-latest on GHA, and ubuntu on CircleCI, off conda-forge as well, no Win testing for us since we've not been able to get a working install of our packages there either, snif but not snif 😁

@valeriupredoi
Copy link
Author

valeriupredoi commented Sep 16, 2022

So here's a pretty interesting case study that may lead to fixing this current issue - and a rather seldomly recurring issue @agstephens and myself have noticed in the past, with older (and stable, bullet-proof versions of netCDF4):

  • our tests fail with a multitude of HDF-related SegFaults, complaints about close dataset etc, anyway:
  • I tried recreating the netcdf sample data that the tests fail on, but I couldn't recreate it since the same segfaults crept up while trying to create them, so...
  • I simply moved them out and back in the location where they should be, and...
  • no more Segfaults (yes I ran quite a few iterations of the test so I am 100% sure they don't fail)

My colleague Ag noticed the same behaviour, way back, on very very few occasions - an HDF Segfault on a certain file would automagically disappear if we moved the file out and back in its location, we blamed it on the FS back then, but thinking in restrospect, it could be the same issue here?

@ocefpaf
Copy link
Collaborator

ocefpaf commented Sep 16, 2022

and we're only seeing issues on windows.

@jswhit it may be prudent to yank those wheels until we figure out what is going on. They do pass the tests in the repo but are not holding well the "production test" :-/

However, the other platforms are failing in other CIs so this is quite confusing and we'll need the reports here to help us sort this out.


Edit: @neutrinoceros your report upstream is about h5py and not netcdf4, right? xref: yt-project/yt#4128

@neutrinoceros
Copy link
Contributor

We're only seeing a warning and yes, it's triggered from h5py. I'm assuming it's the same underlying issue, but that's a wild guess.

@ocefpaf
Copy link
Collaborator

ocefpaf commented Sep 16, 2022

We're only seeing a warning and yes, it's triggered from h5py. I'm assuming it's the same underlying issue, but that's a wild guess.

Most likely not. Folks here are experiencing segfaults with the latest netcdf4-python. The h5py warning in your CI is just b/c one version of hdf5 was used to build but another one is used to run. In my experience that is OK 99.99% of the cases.

@neutrinoceros
Copy link
Contributor

Should I file another issue ?

@ocefpaf
Copy link
Collaborator

ocefpaf commented Sep 16, 2022

Should I file another issue ?

Probably not. It'll be closed b/c it is a known warning that is mostly harmless.

@jswhit
Copy link
Collaborator

jswhit commented Sep 17, 2022

It's not clear to me whether this is an issue with all the wheels for 1.6.1, or just the windows wheels? It's hard to see why the linux and macosx wheels would be a problem, since they are built exactly the same way as they were for 1.6.0. The most significant code change in netcdf4-python in 1.6.1 is PR #1181, but I don't see how this could cause segfaults.

vinisalazar added a commit to vinisalazar/erddapy that referenced this issue Sep 19, 2022
  - Pin older version of netCDF4 to avoid Unidata/netcdf4-python#1192
  - See ioos#268 for discussion
vinisalazar added a commit to vinisalazar/erddapy that referenced this issue Sep 19, 2022
  - Pin older version of netCDF4 to avoid Unidata/netcdf4-python#1192
  - See ioos#268 for discussion
@Zeitsperre
Copy link

I want to share here a workaround I've been using to deal with the netcdf4 python package issue for my projects. After installing all other dependencies, I reinstall netcdf4-python from source with the following (this has solved my issues):

python -m pip install --upgrade --force-reinstall --no-deps --no-cache-dir netcdf4 --no-binary netcdf4

Echoing @jswhit, I mentioned in another issue that I don't think the problem is the code, but the wheel-building process, since installing from sources works perfectly fine.

In any case, this is a really mysterious problem!

@jswhit
Copy link
Collaborator

jswhit commented Sep 20, 2022

@Zeitsperre are you having a problem with windows wheels only, or also the linux and macosx wheels?

@valeriupredoi
Copy link
Author

@jswhit would #1192 (comment) give you some sort of a clue what might trigger those intermittent but rather frequent SegFaults? It's a bit black magic to me at the moment 😁

@valeriupredoi
Copy link
Author

valeriupredoi commented Sep 20, 2022

hey guys, it appears @Zeitsperre is correct and this whole segfaulting issue is happening as a cause of some installation problem: I went the conda-forge way and did a couple black box tests, see below

  • I installed netdf4=1.6.1 as part of our environment, as per usual, and as in the CI case where we noticed the SegFaults, and ran the test that has the tendency to segfault, with 0 and 2 processes: in both cases the test failed (either S: segfault or H: HDF error, see below) 4 out of 12 times, so 8/24 in total; note that this is done on a stable single machine with no other load or shared access;
  • then I downgraded to 1.6.0, and, as expected, nothing poops the bed; not that apart from netcdf4 no other lib has changed version, or build hash;
  • then I re-upgraded (again, just netcdf4 changed, no other dependency) and ran the same experiment, this time around noticing a visibly reduced frequency of fails at 4/24;

Could it be that the conda compilers are not preserving the rght flags or compilation order for you, specifically for 1.6.1? I know (extreme) cases where people need to compile numpy since the conda-forge supplied version is giving them headaches due to numerical precision deltas from version to version, but that's normal(ish). Anyways, here's my test results:

conda-forge install via mamba

netcdf4=1.6.1

  • pytest -n 0 result: SSS000000H00 4/12 fails
  • pytest -n 2 result: SS00H000H000 4/12 fails

downgrade to 1.6.0:

  - netcdf4    1.6.1  nompi_py310h55e1e36_100  conda-forge                    
  + netcdf4    1.6.0  nompi_py310h55e1e36_102  conda-forge/linux-64

(nothing else changed in the conda env)

netcdf4=1.6.0

  • pytest -n 0 result: 0000... no fails
  • pytest -n 2 result: 0000... no fails

reupgrade:

  - netcdf4    1.6.0  nompi_py310h55e1e36_102  conda-forge                    
  + netcdf4    1.6.1  nompi_py310h55e1e36_100  conda-forge/linux-64

netcdf4=1.6.1

  • pytest -n 0 result: 00000000H000 1/12 fails
  • pytest -n 2 result: S0000000SS00 3/12 fails

Legend

  • 0: pass OK
  • S: segfault
  • H -> HDF error:
tests/sample_data/multimgdel_statistics/test_multimodel.py:237: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/sample_data/multimodel_statistics/test_multimodel.py:197: in multimodel_regression_test
    result = multimodel_test(cubes, statistic=statistic, span=span)
tests/sample_data/multimodel_statistics/test_multimodel.py:178: in multimodel_test
    result = multi_model_statistics(products=cubes,
esmvalcore/preprocessor/_multimodel.py:493: in multi_model_statistics
    return _multicube_statistics(
esmvalcore/preprocessor/_multimodel.py:388: in _multicube_statistics
    result_cube = _compute_eager(aligned_cubes,
esmvalcore/preprocessor/_multimodel.py:319: in _compute_eager
    _ = [cube.data for cube in cubes]  # make sure the cubes' data are realized
esmvalcore/preprocessor/_multimodel.py:319: in <listcomp>
    _ = [cube.data for cube in cubes]  # make sure the cubes' data are realized
../miniconda3/envs/flake8/lib/python3.10/site-packages/iris/cube.py:2315: in data
    return self._data_manager.data
../miniconda3/envs/flake8/lib/python3.10/site-packages/iris/_data_manager.py:206: in data
    result = as_concrete_data(self._lazy_array)
../miniconda3/envs/flake8/lib/python3.10/site-packages/iris/_lazy_data.py:252: in as_concrete_data
    (data,) = _co_realise_lazy_arrays([data])
../miniconda3/envs/flake8/lib/python3.10/site-packages/iris/_lazy_data.py:215: in _co_realise_lazy_arrays
    computed_arrays = da.compute(*arrays)
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/base.py:600: in compute
    results = schedule(dsk, keys, **kwargs)
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/threaded.py:89: in get
    results = get_async(
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/local.py:511: in get_async
    raise_exception(exc, tb)
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/local.py:319: in reraise
    raise exc
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/local.py:224: in execute_task
    result = _execute_task(task, data)
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/core.py:119: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/core.py:119: in <genexpr>
    return func(*(_execute_task(a, cache) for a in args))
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/core.py:119: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/core.py:119: in <genexpr>
    return func(*(_execute_task(a, cache) for a in args))
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/core.py:119: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/optimization.py:990: in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/core.py:149: in get
    result = _execute_task(task, cache)
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/core.py:119: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/utils.py:71: in apply
    return func(*args, **kwargs)
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/array/core.py:122: in getter
    c = a[b]
../miniconda3/envs/flake8/lib/python3.10/site-packages/iris/fileformats/netcdf.py:418: in __getitem__
    dataset.close()
src/netCDF4/_netCDF4.pyx:2624: in netCDF4._netCDF4.Dataset.close
    ???
src/netCDF4/_netCDF4.pyx:2587: in netCDF4._netCDF4.Dataset._close
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   RuntimeError: NetCDF: HDF error

src/netCDF4/_netCDF4.pyx:2028: RuntimeError
------------------------------------------------------------------------- Captured stderr call -------------------------------------------------------------------------
HDF5-DIAG: Error detected in HDF5 (1.12.2) MPI-process 0:
  #000: H5D.c line 320 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
============================================

@Zeitsperre
Copy link

@Zeitsperre are you having a problem with windows wheels only, or also the linux and macosx wheels?

I'm only testing on Linux systems, so nothing for me to report on Windows or macOS.

@jswhit
Copy link
Collaborator

jswhit commented Sep 20, 2022

@Zeitsperre are you having a problem with windows wheels only, or also the linux and macosx wheels?

I'm only testing on Linux systems, so nothing for me to report on Windows or macOS.

Are you using wheels from pypi or conda to install?

@ocefpaf
Copy link
Collaborator

ocefpaf commented Sep 20, 2022

Are you using wheels from pypi or conda to install?

Folks, please, everyone that is using the package from conda-forge post your issues and comments in conda-forge/netcdf4-feedstock#141 and not here. Let's help out with the triage so we can solve this!

@valeriupredoi
Copy link
Author

cheers @ocefpaf - I'll link my comment above with the test results to the feedstock issue, good point! I am still not 100% sure it's just conda, or PyPi installations, or the code itself that's causing this, that's why I was primarily posting guff here so the experts may be able to get some clues 🍺

@Zeitsperre
Copy link

@Zeitsperre are you having a problem with windows wheels only, or also the linux and macosx wheels?

I'm only testing on Linux systems, so nothing for me to report on Windows or macOS.

Are you using wheels from pypi or conda to install?

The PyPI wheels have not been working for me, but the conda binaries have been fine for me on Linux.

@jswhit
Copy link
Collaborator

jswhit commented Sep 21, 2022

@valeriupredoi reported at conda-forge/netcdf4-feedstock#141 that his segfaults were all related to the use of file caching, and if the file is read directly from disk the segfaults go away. Are others experiencing segfaults also using some sort of caching of netCDF4.Dataset objects?

@jswhit
Copy link
Collaborator

jswhit commented Sep 22, 2022

From the discussion at conda-forge/netcdf4-feedstock#141, it looks like at least some of the segfaults are related to using netcdf4-python within threads. netcdf-c is not thread-safe, and releasing the GIL on all netcdf-c calls (introduced in 1.6.1) has increased the probability of segfaults when threads are used.

@ocefpaf
Copy link
Collaborator

ocefpaf commented Oct 4, 2022

Folks using iris and hitting this issue you can workaround it by setting dask to single-threaded with:

import dask
dask.config.set(scheduler="single-threaded")

instead of pinning to netcdf4!=1.6.1.

@jswhit
Copy link
Collaborator

jswhit commented Oct 7, 2022

There is an experimental PR in netcdf-c that makes the C library threadsafe. This should fix many (all?) of the problems reported here, but won't be available in a released version for some time.

@trexfeathers
Copy link

There is an experimental PR in netcdf-c that makes the C library threadsafe. This should fix many (all?) of the problems reported here, but won't be available in a released version for some time.

@jswhit would you expect more releases of NetCDF4 before this feature is released?

@dopplershift
Copy link
Member

@trexfeathers speaking on behalf of netcdf-C: yes. It is unclear when a threadsafe version of netcdf-c will be released.

@WeatherGod
Copy link

Is there a plan to revert the GIL-freeing changes? I think I'm getting bitten by this on my linux systems.

tlvu added a commit to Ouranosinc/PAVICS-e2e-workflow-tests that referenced this issue May 24, 2023
See Unidata/netcdf4-python#1192 (comment)

Hopefully able to fix the following weird multiple warnings in homepage notebook 3:
```
HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 1:
  #000: H5A.c line 528 in H5Aopen_by_name(): can't open attribute
    major: Attribute
    minor: Can't open object
  #1: H5VLcallback.c line 1091 in H5VL_attr_open(): attribute open failed
    major: Virtual Object Layer
    minor: Can't open object
  #2: H5VLcallback.c line 1058 in H5VL__attr_open(): attribute open failed
    major: Virtual Object Layer
    minor: Can't open object
  #3: H5VLnative_attr.c line 130 in H5VL__native_attr_open(): can't open attribute
    major: Attribute
    minor: Can't open object
  #4: H5Aint.c line 545 in H5A__open_by_name(): unable to load attribute info from object header
    major: Attribute
    minor: Unable to initialize object
  #5: H5Oattribute.c line 494 in H5O__attr_open_by_name(): can't locate attribute: '_QuantizeBitGroomNumberOfSignificantDigits'
    major: Attribute
    minor: Object not found
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants