Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conflicting values for variable 'lat_bnds' #162

Closed
pochedls opened this issue Nov 19, 2021 · 5 comments · Fixed by #181
Closed

conflicting values for variable 'lat_bnds' #162

pochedls opened this issue Nov 19, 2021 · 5 comments · Fixed by #181
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@pochedls
Copy link
Collaborator

pochedls commented Nov 19, 2021

What versions of software are you using?

  • Package Version: Main branch / Release 0.1.0

What are the steps to reproduce this issue?

import xcdat

p = '/p/css03/scratch/cmip6/CMIP/NCC/NorCPM1/historical/r10i1p1f1/Amon/tas/gn/v20190914/*nc'
ds = xcdat.open_mfdataset(p)

MergeError: conflicting values for variable 'lat_bnds' on objects to be combined. You can skip this check by specifying compat='override'.

What happens? Any logs, error output, etc?

I think this is happening because the lat_bnd values have slight (e.g., 10-13) differences in the different netCDF files.

Any other comments?

Opening with xarray works fine (as long as you do not set data_vars="minimal"), though the lat_bnd is larger than expected:

import xarray as xr
ds = xr.open_mfdataset(p)
print(ds.lat_bnds.shape)

(2160, 142, 2)

It appears that the bounds differ depending on timestep:

np.array(ds.lat_bnds[0] - ds.lat_bnds[-1])

array([[-9.00000000e+01, -8.90526316e+01],
[-8.90526316e+01, -8.71578947e+01],
[-8.71578947e+01, -8.52631579e+01],
...
[ nan, nan],
[ 8.90526316e+01, 9.00000000e+01],
[ nan, nan]])

I'm not totally sure why there are NaN values. They don't appear when I use ncdump -v lat_bnds /p/css03/scratch/cmip6/CMIP/NCC/NorCPM1/historical/r10i1p1f1/Amon/tas/gn/v20190914/tas_Amon_NorCPM1_historical_r10i1p1f1_gn_185001-201412.nc, but they do appear after loading the dataset with xarray.

I do not know what to do with this problem, though CDAT seems to be able to load the bounds with no problem.

@pochedls pochedls added the type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. label Nov 19, 2021
@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Nov 29, 2021

Opening with xarray works fine (as long as you do not set data_vars="minimal"), though the lat_bnd is larger than expected:

import xarray as xr
ds = xr.open_mfdataset(p)
print(ds.lat_bnds.shape)
(2160, 142, 2)

The time dimension gets concatenated to non-time variables like lat_bnds (shown below) when data_vars="minimal" is not set. Related open issue with concatenation of dims: pydata/xarray#2064

>>> ds.lat_bnds.dims
('time', 'lat', 'bnds')

The possible solution, in addition to data_vars="minimal", includes coords="minimal" and compat="override".

  • data_vars: These data variables will be concatenated together:
    • “minimal”: Only data variables in which the dimension already appears are included.
  • coords: These coordinate variables will be concatenated together:
    • “minimal”: Only coordinates in which the dimension already appears are included.
  • compat: String indicating how to compare variables of the same name for potential conflicts when merging
    • “override”: skip comparing and pick variable from first dataset
# flake8: noqa F401
#%%
import xarray as xr

import xcdat

#%%
p = "/p/css03/scratch/cmip6/CMIP/NCC/NorCPM1/historical/r10i1p1f1/Amon/tas/gn/v20190914/*nc"

#%%
# xarray without data_vars="minimal"
# -------------------------------------
# This concats time dimension to non-time vars (non-desirable behavior).
ds_xr_no_min = xr.open_mfdataset(p)
ds_xr_no_min.lat_bnds.shape  # (2160, 142, 2)

#%%
# xcdat with only data_vars="minimal"
# -------------------------------------
# MergeError: conflicting values for variable 'lat_bnds' on objects to be
# combined. You can skip this check by # specifying compat='override'.
ds_xcdat_no_min = xcdat.open_mfdataset(p)

#%%
# xarray with all correct settings
# -------------------------------------
# Does not concat time dimension, but an outer join is performed on incorrect
# mismatching values, resulting in an increase in the number of latitude
# coordinate points (non-desirable behavior).
ds_xr_settings = xr.open_mfdataset(
    p, data_vars="minimal", coords="minimal", compat="override"
)
ds_xr_settings.lat_bnds.shape  # (142, 2)

#%%
# xcdat with all correct settings
# -------------------------------------
# Does not concat time dimension, but an outer join is performed on incorrect
# mismatching values, resulting in an increase in the number of latitude
# coordinate points (non-desirable behavior).
ds_xcdat_settings = xcdat.open_mfdataset(p, coords="minimal", compat="override")
ds_xcdat_settings.lat_bnds.shape  # (142, 2)

It appears that the bounds differ depending on timestep:

np.array(ds.lat_bnds[0] - ds.lat_bnds[-1])

array([[-9.00000000e+01, -8.90526316e+01],
[-8.90526316e+01, -8.71578947e+01],
[-8.71578947e+01, -8.52631579e+01],
...
[ nan, nan],
[ 8.90526316e+01, 9.00000000e+01],
[ nan, nan]])

I'm not totally sure why there are NaN values. They don't appear when I use ncdump -v lat_bnds /p/css03/scratch/cmip6/CMIP/NCC/NorCPM1/historical/r10i1p1f1/Amon/tas/gn/v20190914/tas_Amon_NorCPM1_historical_r10i1p1f1_gn_185001-201412.nc, but they do appear after loading the dataset with xarray.

After resolving the concatenation of the time dimension and data variable compatibility issues, I still noticed NaN values as well. I am investigating why this is happening in xarray.

>>> ds_xr_with_min.lat_bnds[0].values
array([-90.        , -89.05263158])

>>> ds_xr_with_min.lat_bnds[-1].values
array([nan, nan])

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Nov 29, 2021

I think nan values are being produced because there are very small differences in lat_bnds values as you mentioned (also shown in code example below) and an outer join (union of object indexes) is performed by default. These nan values subsequently increase the size of lat_bnds from (96, 2) to (142, 2) after calling open_mfdataset().

Comparing shapes of lat_bnds between datasets and performing floating point comparison:

# flake8: noqa F401
#%%
import numpy as np
import xarray as xr

import xcdat

path = "/p/css03/scratch/cmip6/CMIP/NCC/NorCPM1/historical/r10i1p1f1/Amon/tas/gn/v20190914/"
p = f"{path}*nc"

#%%
# 1. Check xcdat with all correct settings
# -------------------------------------
# Does not concat time dimension, but an outer join is performed on incorrect
# mismatching values, resulting in an increase in the number of latitude
# coordinate points (non-desirable behavior).
ds_mf = xcdat.open_mfdataset(p, coords="minimal", compat="override")
ds_mf.lat_bnds.shape  # (142, 2)

# Check for nans
nan_indices = np.where(np.isnan(ds_mf.lat_bnds[:, :].values))[0]
nan_indices.size  # 92

#%%
# 2. Check latitude sizes of individual files
# ------------------------------------------
# Make sure that the sizes of the latitude bounds are aligned
ds1 = xr.open_dataset(
    f"{path}tas_Amon_NorCPM1_historical_r10i1p1f1_gn_185001-201412.nc"
)
ds2 = xr.open_dataset(
    f"{path}tas_Amon_NorCPM1_historical_r10i1p1f1_gn_201501-201812.nc"
)
ds3 = xr.open_dataset(
    f"{path}tas_Amon_NorCPM1_historical_r10i1p1f1_gn_201901-202912.nc"
)

ds1.lat_bnds.shape  # (96, 2)
ds2.lat_bnds.shape  # (96, 2)
ds3.lat_bnds.shape  # (96, 2)

#%%
# 3. Check for floating point differences between files
# --------------------------------------------------
np.testing.assert_allclose(ds1.lat_bnds, ds2.lat_bnds)
"""
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 2 / 192 (1.04%)
Max absolute difference: 2.84217094e-14
Max relative difference: 1.17190208e-15
 x: array([[-9.000000e+01, -8.905263e+01],
       [-8.905263e+01, -8.715789e+01],
       [-8.715789e+01, -8.526316e+01],...
 y: array([[-90.      , -89.052632],
       [-89.052632, -87.157895],
       [-87.157895, -85.263158],.
"""

np.testing.assert_allclose(ds2.lat_bnds, ds3.lat_bnds)  # True

np.testing.assert_allclose(ds1.lat_bnds, ds3.lat_bnds)
"""
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 2 / 192 (1.04%)
Max absolute difference: 2.84217094e-14
Max relative difference: 1.17190208e-15
 x: array([[-9.000000e+01, -8.905263e+01],
       [-8.905263e+01, -8.715789e+01],
       [-8.715789e+01, -8.526316e+01],...
 y: array([[-90.      , -89.052632],
       [-89.052632, -87.157895],
       [-87.157895, -85.263158],...
"""

Comparing join options since I noticed that the lat_bnds of the individual datasets has a shape of (96, 2), not (142, 2):

#%%
# 4. Use different joins to avoid concatenating additional coordinate points
# ------------------------------------------------------------------------
# a. Outer join (default)
# ~~~~~~~~~~~~~~~~~~~~~~~
# use the union of object indexes, produces nans if there are floating point
# diffs between values
ds_outer = xcdat.open_mfdataset(
    p, data_vars="minimal", coords="minimal", compat="override", join="outer"
)
ds_outer.lat_bnds.shape  # (142, 2)

nan_indices = np.where(np.isnan(ds_outer.lat_bnds[:, :].values))[0]
nan_indices.size  # 92

#%%
# b. Left join
# ~~~~~~~~~~~~~~~~
# use indexes from the first object with each dimension
ds_left = xcdat.open_mfdataset(
    p, data_vars="minimal", coords="minimal", compat="override", join="left"
)

ds_left.lat_bnds.shape  # (92, 2)

nan_indices = np.where(np.isnan(ds_left.lat_bnds[:, :].values))[0]
nan_indices.size  # 0

#%%
# c. Override join
# ~~~~~~~~~~~~~~~~~
# if indexes are of same size, rewrite indexes to be those of the first object
# with that dimension. Indexes for the same dimension must have the same size in
# all objects.
ds_override = xcdat.open_mfdataset(
    p, data_vars="minimal", coords="minimal", compat="override", join="override"
)
ds_override.lat_bnds.shape  # (92, 2)

nan_indices = np.where(np.isnan(ds_override.lat_bnds[:, :].values))[0]
nan_indices.size  # 0

# %%
ds_left.lat_bnds.identical(ds_override.lat_bnds)  # True

# Conclusion -- Use data_vars="minimal", coords="minimal", compat="override",
# and join="left" or "override" if datasets have conflictings bounds values

The possible solutions I found so far are:

  1. Use join="override" or join="left" in open_mfdataset()
    • There may be implications with this option (e.g., if the first dataset has missing values for lat_bnds and you use them for the single joined dataset).
  2. Use join="outer" in open_mfdataset() as is, but with fillna()
    • The size of variables like lat_bnds will change since an outer join is performed. Not sure how CDAT handles it.

@tomvothecoder
Copy link
Collaborator

Based on my findings above, I think we should provide those two options in the docs/docstring for cases where Datasets have conflicting values.

Since xarray/xcdat provides kwarg args to handle this edge case, we should probably avoid implementing code to try to handle it.

@pochedls
Copy link
Collaborator Author

pochedls commented Jan 9, 2022

This seems like a reasonable solution. One slightly more complex way of handling this situation would be to try to detect this problem with a pre-processor function, which could throw a warning or exception. I wasn't sure if the pre-processor functions can communicate information between files (e.g., to compare bounds across files) or if they must act independently on each netcdf file (and thus cannot compare bounds across files).

@tomvothecoder
Copy link
Collaborator

Thanks, I'll add documentation to cover this situation.

The preprocessing function is performed on each file independently before they are merged together into a single Dataset, so there is no communication between files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants