Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow fix_file to return Cube and CubeList objects #2579

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
68845bb
Avoid copying input data in CMIP6 CESM2 fixes
bouweandela Jun 10, 2024
2541ab1
run a GA test
valeriupredoi Jun 17, 2024
c038510
Use ruff formatting
bouweandela Sep 26, 2024
3e95d3e
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into in-memo…
bouweandela Sep 26, 2024
2e388ed
Disable github action on branch
bouweandela Sep 26, 2024
55bde64
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into in-memo…
bouweandela Oct 17, 2024
e890f56
Fewer changes
bouweandela Oct 17, 2024
142ba29
Fewer changes
bouweandela Oct 17, 2024
625aa75
Move different attribute removal to concatenate
bouweandela Oct 17, 2024
588130a
Add test
bouweandela Oct 17, 2024
aadffe8
Various improvements
bouweandela Oct 23, 2024
07a5e37
Add docstring
bouweandela Oct 23, 2024
b95ee63
Merge branch 'main' into in-memory-fix-file
bouweandela Oct 23, 2024
379fd45
Merge remote-tracking branch 'origin/in-memory-fix-file' into better_…
schlunma Nov 11, 2024
e593611
Do not change CESM2 fixes
schlunma Nov 11, 2024
645d555
More flexible fix_file and ignore warnings
schlunma Nov 11, 2024
891e82b
Add function to convert between xarray/ncdata and iris
schlunma Nov 11, 2024
77c9ef1
Add first tests
schlunma Nov 11, 2024
94ccc90
Add more tests
schlunma Nov 12, 2024
a24de25
Add test to check that no warning is raised
schlunma Nov 12, 2024
3d1bd93
Use ignore_warnings in fix_file
schlunma Nov 12, 2024
e127a9d
Add doc
schlunma Nov 12, 2024
31d3762
Better doc
schlunma Nov 12, 2024
9dbbb40
Merge remote-tracking branch 'origin/main' into better_fix_file
schlunma Nov 12, 2024
7696480
100% coverage in fix.py
schlunma Nov 12, 2024
11c1b51
100% coverage
schlunma Nov 12, 2024
7e097bb
Avoid circular import
schlunma Nov 12, 2024
b8eb5b8
Doc
schlunma Nov 12, 2024
543efe1
Better comment
schlunma Nov 12, 2024
6808164
100% coverage
schlunma Nov 12, 2024
ee043a1
Add missing parameter to docstring
schlunma Nov 12, 2024
b9f70f7
Better docstrings
schlunma Nov 12, 2024
91bbe8d
Improve doc rendering
schlunma Nov 12, 2024
903d694
Double backticks for monospace font
schlunma Nov 12, 2024
b47ba81
Better docstring
schlunma Nov 14, 2024
863b0f1
Merge remote-tracking branch 'origin/main' into better_fix_file
schlunma Dec 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -456,10 +456,12 @@
'iris': ('https://scitools-iris.readthedocs.io/en/stable/', None),
'esmf_regrid': ('https://iris-esmf-regrid.readthedocs.io/en/stable/', None),
'matplotlib': ('https://matplotlib.org/stable/', None),
'ncdata': ('https://ncdata.readthedocs.io/en/stable/', None),
'numpy': ('https://numpy.org/doc/stable/', None),
'pyesgf': ('https://esgf-pyclient.readthedocs.io/en/stable/', None),
'python': ('https://docs.python.org/3/', None),
'scipy': ('https://docs.scipy.org/doc/scipy/', None),
'xarray': ('https://docs.xarray.dev/en/stable/', None),
}

# -- Extlinks extension -------------------------------------------------------
Expand Down
6 changes: 3 additions & 3 deletions doc/develop/fixing_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -126,9 +126,9 @@ Then we have to create the class for the fix deriving from
Next we must choose the method to use between the ones offered by the
Fix class:

- ``fix_file``: should be used only to fix errors that prevent data loading.
As a rule of thumb, you should only use it if the execution halts before
reaching the checks.
- ``fix_file``: you need to fix errors that prevent loading the data with Iris
or perform operations that are more efficient with other packages (e.g.,
loading files with lots of variables is much faster with Xarray than Iris).

- ``fix_metadata``: you want to change something in the cube that is not
the data (e.g., variable or coordinate names, data units).
Expand Down
1 change: 1 addition & 0 deletions doc/quickstart/configure.rst
Original file line number Diff line number Diff line change
Expand Up @@ -844,6 +844,7 @@ the preprocessing chain.

Currently supported preprocessor steps:

* :func:`~esmvalcore.preprocessor.fix_file`
* :func:`~esmvalcore.preprocessor.load`

Here is an example on how to ignore specific warnings during the preprocessor
Expand Down
30 changes: 16 additions & 14 deletions doc/recipe/preprocessor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -272,20 +272,22 @@ ESMValCore deals with those issues by applying specific fixes for those
datasets that require them. Fixes are applied at three different preprocessor
steps:

- ``fix_file``: apply fixes directly to a copy of the file.
Copying the files is costly, so only errors that prevent Iris to load the
file are fixed here.
See :func:`esmvalcore.preprocessor.fix_file`.

- ``fix_metadata``: metadata fixes are done just before concatenating the
cubes loaded from different files in the final one.
Automatic metadata fixes are also applied at this step.
See :func:`esmvalcore.preprocessor.fix_metadata`.

- ``fix_data``: data fixes are applied before starting any operation that
will alter the data itself.
Automatic data fixes are also applied at this step.
See :func:`esmvalcore.preprocessor.fix_data`.
- ``fix_file``: apply fixes to data before loading them with Iris.
This is mainly intended to fix errors that prevent data loading with Iris
(e.g., those related to ``missing_value`` or ``_FillValue``) or
operations that are more efficient with other packages (e.g., loading
files with lots of variables is much faster with Xarray than Iris). See
:func:`esmvalcore.preprocessor.fix_file`.

- ``fix_metadata``: metadata fixes are done just before concatenating the
cubes loaded from different files in the final one.
Automatic metadata fixes are also applied at this step.
See :func:`esmvalcore.preprocessor.fix_metadata`.

- ``fix_data``: data fixes are applied before starting any operation that
will alter the data itself.
Automatic data fixes are also applied at this step.
See :func:`esmvalcore.preprocessor.fix_data`.

To get an overview on data fixes and how to implement new ones, please go to
:ref:`fixing_data`.
Expand Down
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ dependencies:
- jinja2
- libnetcdf !=4.9.1 # to avoid hdf5 warnings
- nc-time-axis
- ncdata
- nested-lookup
- netcdf4
- numpy !=1.24.3
Expand Down
1 change: 1 addition & 0 deletions esmvalcore/cmor/_fixes/cmip6/cesm2.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ def _fix_formula_terms(
filepath,
output_dir,
add_unique_suffix=False,
ignore_warnings=None,
):
"""Fix ``formula_terms`` attribute."""
new_path = self.get_fixed_filepath(
Expand Down
8 changes: 7 additions & 1 deletion esmvalcore/cmor/_fixes/cmip6/cesm2_waccm.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,13 @@
class Cl(BaseCl):
"""Fixes for cl."""

def fix_file(self, filepath, output_dir, add_unique_suffix=False):
def fix_file(
self,
filepath,
output_dir,
add_unique_suffix=False,
ignore_warnings=None,
):
"""Fix hybrid pressure coordinate.

Adds missing ``formula_terms`` attribute to file.
Expand Down
8 changes: 7 additions & 1 deletion esmvalcore/cmor/_fixes/emac/emac.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,13 @@ class AllVars(EmacFix):
"kg/m**2s": "kg m-2 s-1",
}

def fix_file(self, filepath, output_dir, add_unique_suffix=False):
def fix_file(
self,
filepath,
output_dir,
add_unique_suffix=False,
ignore_warnings=None,
):
"""Fix file.

Fixes hybrid pressure level coordinate.
Expand Down
129 changes: 117 additions & 12 deletions esmvalcore/cmor/_fixes/fix.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,12 @@
from typing import TYPE_CHECKING, Any, Optional

import dask
import iris
import ncdata.iris
import ncdata.iris_xarray
import ncdata.threadlock_sharing
import numpy as np
import xarray as xr
from cf_units import Unit
from iris.coords import Coord, CoordExtent
from iris.cube import Cube, CubeList
Expand All @@ -27,7 +32,11 @@
)
from esmvalcore.cmor.fixes import get_time_bounds
from esmvalcore.cmor.table import get_var_info
from esmvalcore.iris_helpers import has_unstructured_grid, safe_convert_units
from esmvalcore.iris_helpers import (
has_unstructured_grid,
ignore_warnings_context,
safe_convert_units,
)

if TYPE_CHECKING:
from esmvalcore.cmor.table import CoordinateInfo, VariableInfo
Expand All @@ -36,6 +45,9 @@
logger = logging.getLogger(__name__)
generic_fix_logger = logging.getLogger(f"{__name__}.genericfix")

# Enable lock sharing between ncdata and iris/xarray
ncdata.threadlock_sharing.enable_lockshare(iris=True, xarray=True)


class Fix:
"""Base class for dataset fixes."""
Expand Down Expand Up @@ -78,28 +90,43 @@ def fix_file(
filepath: Path,
output_dir: Path,
add_unique_suffix: bool = False,
) -> str | Path:
"""Apply fixes to the files prior to creating the cube.
ignore_warnings: Optional[list[dict]] = None,
) -> str | Path | Cube | CubeList:
"""Fix files before loading them into a :class:`~iris.cube.CubeList`.

Should be used only to fix errors that prevent loading or cannot be
fixed in the cube (e.g., those related to `missing_value` or
`_FillValue`).
This is mainly intended to fix errors that prevent loading the data
with Iris (e.g., those related to ``missing_value`` or ``_FillValue``)
or operations that are more efficient with other packages (e.g.,
loading files with lots of variables is much faster with Xarray than
Iris).

Warning
-------
A path should only be returned if it points to the original (unchanged)
file (i.e., a fix was not necessary). If a fix is necessary, this
function should return a :class:`~iris.cube.Cube` or
:class:`~iris.cube.CubeList`, which can for example be created from an
:class:`~ncdata.NcData` or :class:`~xarray.Dataset` object using the
helper function ``Fix.dataset_to_iris()``. Under no circumstances a
copy of the input data should be created (this is very inefficient).

Parameters
----------
filepath:
File to fix.
Path to the original file. Original files should not be overwritten.
output_dir:
Output directory for fixed files.
add_unique_suffix:
Adds a unique suffix to `output_dir` for thread safety.
Adds a unique suffix to ``output_dir`` for thread safety.
ignore_warnings:
Keyword arguments passed to :func:`warnings.filterwarnings` used to
ignore warnings during data loading. Each list element corresponds
to one call to :func:`warnings.filterwarnings`.

Returns
-------
str or pathlib.Path
Path to the corrected file. It can be different from the original
filepath if a fix has been applied, but if not it should be the
original filepath.
str | Path | Cube | CubeList:
Fixed cube(s) or a path to them.

"""
return filepath
Expand Down Expand Up @@ -157,6 +184,84 @@ def get_cube_from_list(
return cube
raise ValueError(f'Cube for variable "{short_name}" not found')

@staticmethod
def _get_attribute(
data: ncdata.NcData | ncdata.NcVariable | xr.Dataset | xr.DataArray,
attribute_name: str,
) -> Any:
"""Get attribute from an ncdata or xarray object."""
if hasattr(data, "attributes"): # ncdata.NcData | ncdata.NcVariable
attribute = data.attributes[attribute_name].value
else: # xr.Dataset | xr.DataArray
attribute = data.attrs[attribute_name]
return attribute

def dataset_to_iris(
self,
dataset: ncdata.NcData | xr.Dataset,
filepath: str | Path,
ignore_warnings: Optional[list[dict]] = None,
) -> CubeList:
"""Convert dataset to :class:`~iris.cube.CubeList`.

This function mimics the behavior of
:func:`esmvalcore.preprocessor.load`.

Parameters
----------
dataset:
The dataset object to convert.
filepath:
The path that the dataset was loaded from.
ignore_warnings:
Keyword arguments passed to :func:`warnings.filterwarnings` used to
ignore warnings during data loading. Each list element corresponds
to one call to :func:`warnings.filterwarnings`.

Returns
-------
CubeList
:class:`~iris.cube.CubeList` containing the requested cubes.

Raises
------
TypeError
Invalid type for ``dataset`` given.

"""
if isinstance(dataset, ncdata.NcData):
conversion_func = ncdata.iris.to_iris
ds_coords = dataset.variables
elif isinstance(dataset, xr.Dataset):
conversion_func = ncdata.iris_xarray.cubes_from_xarray
ds_coords = dataset.coords
else:
raise TypeError(
f"Expected type ncdata.NcData or xr.Dataset for dataset, got "
f"type {type(dataset)}"
)

with ignore_warnings_context(ignore_warnings):
cubes = conversion_func(dataset)

# Restore the lat/lon coordinate units that iris changes to degrees
for coord_name in ["latitude", "longitude"]:
for cube in cubes:
try:
coord = cube.coord(coord_name)
except iris.exceptions.CoordinateNotFoundError:
pass
else:
if coord.var_name in ds_coords:
ds_coord = ds_coords[coord.var_name]
coord.units = self._get_attribute(ds_coord, "units")

# Add the source file as an attribute to support grouping by
# file when calling fix_metadata.
cube.attributes["source_file"] = str(filepath)

return cubes

def fix_data(self, cube: Cube) -> Cube:
"""Apply fixes to the data of the cube.

Expand Down
8 changes: 7 additions & 1 deletion esmvalcore/cmor/_fixes/ipslcm/ipsl_cm6.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,13 @@
class AllVars(Fix):
"""Fixes for all IPSLCM variables."""

def fix_file(self, filepath, output_dir, add_unique_suffix=False):
def fix_file(
self,
filepath,
output_dir,
add_unique_suffix=False,
ignore_warnings=None,
):
"""Select IPSLCM variable in filepath.

This is done only if input file is a multi-variable one. This
Expand Down
38 changes: 28 additions & 10 deletions esmvalcore/cmor/fix.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,19 +33,30 @@ def fix_file(
add_unique_suffix: bool = False,
session: Optional[Session] = None,
frequency: Optional[str] = None,
ignore_warnings: Optional[list[dict]] = None,
**extra_facets,
) -> str | Path:
"""Fix files before ESMValTool can load them.
) -> str | Path | Cube | CubeList:
"""Fix files before loading them into a :class:`~iris.cube.CubeList`.

These fixes are only for issues that prevent iris from loading the cube or
that cannot be fixed after the cube is loaded.
This is mainly intended to fix errors that prevent loading the data with
Iris (e.g., those related to ``missing_value`` or ``_FillValue``) or
operations that are more efficient with other packages (e.g., loading files
with lots of variables is much faster with Xarray than Iris).

Original files are not overwritten.
Warning
-------
A path should only be returned if it points to the original (unchanged)
file (i.e., a fix was not necessary). If a fix is necessary, this function
should return a :class:`~iris.cube.Cube` or :class:`~iris.cube.CubeList`,
which can for example be created from an :class:`~ncdata.NcData` or
:class:`~xarray.Dataset` object using the helper function
``Fix.dataset_to_iris()``. Under no circumstances a copy of the input data
should be created (this is very inefficient).

Parameters
----------
file:
Path to the original file.
Path to the original file. Original files are not overwritten.
short_name:
Variable's short name.
project:
Expand All @@ -57,19 +68,23 @@ def fix_file(
output_dir:
Output directory for fixed files.
add_unique_suffix:
Adds a unique suffix to `output_dir` for thread safety.
Adds a unique suffix to ``output_dir`` for thread safety.
session:
Current session which includes configuration and directory information.
frequency:
Variable's data frequency, if available.
ignore_warnings:
Keyword arguments passed to :func:`warnings.filterwarnings` used to
ignore warnings during data loading. Each list element corresponds to
one call to :func:`warnings.filterwarnings`.
**extra_facets:
Extra facets are mainly used for data outside of the big projects like
CMIP, CORDEX, obs4MIPs. For details, see :ref:`extra_facets`.

Returns
-------
str or pathlib.Path
Path to the fixed file.
str | Path | Cube | CubeList:
Fixed cube(s) or a path to them.

"""
# Update extra_facets with variable information given as regular arguments
Expand All @@ -94,7 +109,10 @@ def fix_file(
frequency=frequency,
):
file = fix.fix_file(
file, output_dir, add_unique_suffix=add_unique_suffix
file,
output_dir,
add_unique_suffix=add_unique_suffix,
ignore_warnings=ignore_warnings,
)
return file

Expand Down
3 changes: 3 additions & 0 deletions esmvalcore/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -753,6 +753,9 @@ def _load(self) -> Cube:
"output_dir": fix_dir_prefix,
"add_unique_suffix": True,
"session": self.session,
"ignore_warnings": get_ignored_warnings(
self.facets["project"], "fix_file"
),
**self.facets,
}
settings["load"] = {
Expand Down
Loading
Loading