conflicting values for variable 'lat_bnds' #162

pochedls · 2021-11-19T00:51:12Z

What versions of software are you using?

Package Version: Main branch / Release 0.1.0

What are the steps to reproduce this issue?

import xcdat

p = '/p/css03/scratch/cmip6/CMIP/NCC/NorCPM1/historical/r10i1p1f1/Amon/tas/gn/v20190914/*nc'
ds = xcdat.open_mfdataset(p)

MergeError: conflicting values for variable 'lat_bnds' on objects to be combined. You can skip this check by specifying compat='override'.

What happens? Any logs, error output, etc?

I think this is happening because the lat_bnd values have slight (e.g., 10^-13) differences in the different netCDF files.

Any other comments?

Opening with xarray works fine (as long as you do not set data_vars="minimal"), though the lat_bnd is larger than expected:

import xarray as xr
ds = xr.open_mfdataset(p)
print(ds.lat_bnds.shape)

(2160, 142, 2)

It appears that the bounds differ depending on timestep:

np.array(ds.lat_bnds[0] - ds.lat_bnds[-1])

array([[-9.00000000e+01, -8.90526316e+01],
[-8.90526316e+01, -8.71578947e+01],
[-8.71578947e+01, -8.52631579e+01],
...
[ nan, nan],
[ 8.90526316e+01, 9.00000000e+01],
[ nan, nan]])

I'm not totally sure why there are NaN values. They don't appear when I use ncdump -v lat_bnds /p/css03/scratch/cmip6/CMIP/NCC/NorCPM1/historical/r10i1p1f1/Amon/tas/gn/v20190914/tas_Amon_NorCPM1_historical_r10i1p1f1_gn_185001-201412.nc, but they do appear after loading the dataset with xarray.

I do not know what to do with this problem, though CDAT seems to be able to load the bounds with no problem.

The text was updated successfully, but these errors were encountered:

tomvothecoder · 2021-11-29T21:37:35Z

Opening with xarray works fine (as long as you do not set data_vars="minimal"), though the lat_bnd is larger than expected:

import xarray as xr
ds = xr.open_mfdataset(p)
print(ds.lat_bnds.shape)
(2160, 142, 2)

The time dimension gets concatenated to non-time variables like lat_bnds (shown below) when data_vars="minimal" is not set. Related open issue with concatenation of dims: pydata/xarray#2064

>>> ds.lat_bnds.dims
('time', 'lat', 'bnds')

The possible solution, in addition to data_vars="minimal", includes coords="minimal" and compat="override".

data_vars: These data variables will be concatenated together:
- “minimal”: Only data variables in which the dimension already appears are included.
coords: These coordinate variables will be concatenated together:
- “minimal”: Only coordinates in which the dimension already appears are included.
compat: String indicating how to compare variables of the same name for potential conflicts when merging
- “override”: skip comparing and pick variable from first dataset

# flake8: noqa F401
#%%
import xarray as xr

import xcdat

#%%
p = "/p/css03/scratch/cmip6/CMIP/NCC/NorCPM1/historical/r10i1p1f1/Amon/tas/gn/v20190914/*nc"

#%%
# xarray without data_vars="minimal"
# -------------------------------------
# This concats time dimension to non-time vars (non-desirable behavior).
ds_xr_no_min = xr.open_mfdataset(p)
ds_xr_no_min.lat_bnds.shape  # (2160, 142, 2)

#%%
# xcdat with only data_vars="minimal"
# -------------------------------------
# MergeError: conflicting values for variable 'lat_bnds' on objects to be
# combined. You can skip this check by # specifying compat='override'.
ds_xcdat_no_min = xcdat.open_mfdataset(p)

#%%
# xarray with all correct settings
# -------------------------------------
# Does not concat time dimension, but an outer join is performed on incorrect
# mismatching values, resulting in an increase in the number of latitude
# coordinate points (non-desirable behavior).
ds_xr_settings = xr.open_mfdataset(
    p, data_vars="minimal", coords="minimal", compat="override"
)
ds_xr_settings.lat_bnds.shape  # (142, 2)

#%%
# xcdat with all correct settings
# -------------------------------------
# Does not concat time dimension, but an outer join is performed on incorrect
# mismatching values, resulting in an increase in the number of latitude
# coordinate points (non-desirable behavior).
ds_xcdat_settings = xcdat.open_mfdataset(p, coords="minimal", compat="override")
ds_xcdat_settings.lat_bnds.shape  # (142, 2)

It appears that the bounds differ depending on timestep:
np.array(ds.lat_bnds[0] - ds.lat_bnds[-1])
array([[-9.00000000e+01, -8.90526316e+01],
[-8.90526316e+01, -8.71578947e+01],
[-8.71578947e+01, -8.52631579e+01],
...
[ nan, nan],
[ 8.90526316e+01, 9.00000000e+01],
[ nan, nan]])

I'm not totally sure why there are NaN values. They don't appear when I use ncdump -v lat_bnds /p/css03/scratch/cmip6/CMIP/NCC/NorCPM1/historical/r10i1p1f1/Amon/tas/gn/v20190914/tas_Amon_NorCPM1_historical_r10i1p1f1_gn_185001-201412.nc, but they do appear after loading the dataset with xarray.

After resolving the concatenation of the time dimension and data variable compatibility issues, I still noticed NaN values as well. I am investigating why this is happening in xarray.

>>> ds_xr_with_min.lat_bnds[0].values
array([-90.        , -89.05263158])

>>> ds_xr_with_min.lat_bnds[-1].values
array([nan, nan])

tomvothecoder · 2021-11-29T22:53:45Z

I think nan values are being produced because there are very small differences in lat_bnds values as you mentioned (also shown in code example below) and an outer join (union of object indexes) is performed by default. These nan values subsequently increase the size of lat_bnds from (96, 2) to (142, 2) after calling open_mfdataset().

Comparing shapes of lat_bnds between datasets and performing floating point comparison:

# flake8: noqa F401
#%%
import numpy as np
import xarray as xr

import xcdat

path = "/p/css03/scratch/cmip6/CMIP/NCC/NorCPM1/historical/r10i1p1f1/Amon/tas/gn/v20190914/"
p = f"{path}*nc"

#%%
# 1. Check xcdat with all correct settings
# -------------------------------------
# Does not concat time dimension, but an outer join is performed on incorrect
# mismatching values, resulting in an increase in the number of latitude
# coordinate points (non-desirable behavior).
ds_mf = xcdat.open_mfdataset(p, coords="minimal", compat="override")
ds_mf.lat_bnds.shape  # (142, 2)

# Check for nans
nan_indices = np.where(np.isnan(ds_mf.lat_bnds[:, :].values))[0]
nan_indices.size  # 92

#%%
# 2. Check latitude sizes of individual files
# ------------------------------------------
# Make sure that the sizes of the latitude bounds are aligned
ds1 = xr.open_dataset(
    f"{path}tas_Amon_NorCPM1_historical_r10i1p1f1_gn_185001-201412.nc"
)
ds2 = xr.open_dataset(
    f"{path}tas_Amon_NorCPM1_historical_r10i1p1f1_gn_201501-201812.nc"
)
ds3 = xr.open_dataset(
    f"{path}tas_Amon_NorCPM1_historical_r10i1p1f1_gn_201901-202912.nc"
)

ds1.lat_bnds.shape  # (96, 2)
ds2.lat_bnds.shape  # (96, 2)
ds3.lat_bnds.shape  # (96, 2)

#%%
# 3. Check for floating point differences between files
# --------------------------------------------------
np.testing.assert_allclose(ds1.lat_bnds, ds2.lat_bnds)
"""
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 2 / 192 (1.04%)
Max absolute difference: 2.84217094e-14
Max relative difference: 1.17190208e-15
 x: array([[-9.000000e+01, -8.905263e+01],
       [-8.905263e+01, -8.715789e+01],
       [-8.715789e+01, -8.526316e+01],...
 y: array([[-90.      , -89.052632],
       [-89.052632, -87.157895],
       [-87.157895, -85.263158],.
"""

np.testing.assert_allclose(ds2.lat_bnds, ds3.lat_bnds)  # True

np.testing.assert_allclose(ds1.lat_bnds, ds3.lat_bnds)
"""
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 2 / 192 (1.04%)
Max absolute difference: 2.84217094e-14
Max relative difference: 1.17190208e-15
 x: array([[-9.000000e+01, -8.905263e+01],
       [-8.905263e+01, -8.715789e+01],
       [-8.715789e+01, -8.526316e+01],...
 y: array([[-90.      , -89.052632],
       [-89.052632, -87.157895],
       [-87.157895, -85.263158],...
"""

Comparing join options since I noticed that the lat_bnds of the individual datasets has a shape of (96, 2), not (142, 2):

#%%
# 4. Use different joins to avoid concatenating additional coordinate points
# ------------------------------------------------------------------------
# a. Outer join (default)
# ~~~~~~~~~~~~~~~~~~~~~~~
# use the union of object indexes, produces nans if there are floating point
# diffs between values
ds_outer = xcdat.open_mfdataset(
    p, data_vars="minimal", coords="minimal", compat="override", join="outer"
)
ds_outer.lat_bnds.shape  # (142, 2)

nan_indices = np.where(np.isnan(ds_outer.lat_bnds[:, :].values))[0]
nan_indices.size  # 92

#%%
# b. Left join
# ~~~~~~~~~~~~~~~~
# use indexes from the first object with each dimension
ds_left = xcdat.open_mfdataset(
    p, data_vars="minimal", coords="minimal", compat="override", join="left"
)

ds_left.lat_bnds.shape  # (92, 2)

nan_indices = np.where(np.isnan(ds_left.lat_bnds[:, :].values))[0]
nan_indices.size  # 0

#%%
# c. Override join
# ~~~~~~~~~~~~~~~~~
# if indexes are of same size, rewrite indexes to be those of the first object
# with that dimension. Indexes for the same dimension must have the same size in
# all objects.
ds_override = xcdat.open_mfdataset(
    p, data_vars="minimal", coords="minimal", compat="override", join="override"
)
ds_override.lat_bnds.shape  # (92, 2)

nan_indices = np.where(np.isnan(ds_override.lat_bnds[:, :].values))[0]
nan_indices.size  # 0

# %%
ds_left.lat_bnds.identical(ds_override.lat_bnds)  # True

# Conclusion -- Use data_vars="minimal", coords="minimal", compat="override",
# and join="left" or "override" if datasets have conflictings bounds values

The possible solutions I found so far are:

Use join="override" or join="left" in open_mfdataset()
- There may be implications with this option (e.g., if the first dataset has missing values for lat_bnds and you use them for the single joined dataset).
Use join="outer" in open_mfdataset() as is, but with fillna()
- The size of variables like lat_bnds will change since an outer join is performed. Not sure how CDAT handles it.

tomvothecoder · 2022-01-05T17:04:37Z

Based on my findings above, I think we should provide those two options in the docs/docstring for cases where Datasets have conflicting values.

Since xarray/xcdat provides kwarg args to handle this edge case, we should probably avoid implementing code to try to handle it.

pochedls · 2022-01-09T21:35:39Z

This seems like a reasonable solution. One slightly more complex way of handling this situation would be to try to detect this problem with a pre-processor function, which could throw a warning or exception. I wasn't sure if the pre-processor functions can communicate information between files (e.g., to compare bounds across files) or if they must act independently on each netcdf file (and thus cannot compare bounds across files).

tomvothecoder · 2022-01-10T16:21:07Z

Thanks, I'll add documentation to cover this situation.

The preprocessing function is performed on each file independently before they are merged together into a single Dataset, so there is no communication between files.

pochedls added the type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. label Nov 19, 2021

pochedls mentioned this issue Nov 19, 2021

Investigate CDAT versus xcdat spatial average differences #166

Closed

tomvothecoder added the Priority: High label Nov 24, 2021

tomvothecoder mentioned this issue Jan 10, 2022

Add FAQs page to docs #181

Merged

9 tasks

tomvothecoder closed this as completed in #181 Jan 10, 2022

pochedls mentioned this issue Nov 15, 2022

[Bug]: Additional dimension for lat_bnds and lon_bnds #390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conflicting values for variable 'lat_bnds' #162

conflicting values for variable 'lat_bnds' #162

pochedls commented Nov 19, 2021 •

edited

Loading

tomvothecoder commented Nov 29, 2021 •

edited

Loading

tomvothecoder commented Nov 29, 2021 •

edited

Loading

tomvothecoder commented Jan 5, 2022

pochedls commented Jan 9, 2022

tomvothecoder commented Jan 10, 2022

conflicting values for variable 'lat_bnds' #162

conflicting values for variable 'lat_bnds' #162

Comments

pochedls commented Nov 19, 2021 • edited Loading

What versions of software are you using?

What are the steps to reproduce this issue?

What happens? Any logs, error output, etc?

Any other comments?

tomvothecoder commented Nov 29, 2021 • edited Loading

tomvothecoder commented Nov 29, 2021 • edited Loading

tomvothecoder commented Jan 5, 2022

pochedls commented Jan 9, 2022

tomvothecoder commented Jan 10, 2022

pochedls commented Nov 19, 2021 •

edited

Loading

tomvothecoder commented Nov 29, 2021 •

edited

Loading

tomvothecoder commented Nov 29, 2021 •

edited

Loading