Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create "subset-by-point" capability #105

Closed
5 tasks done
agstephens opened this issue Sep 8, 2021 · 13 comments
Closed
5 tasks done

Create "subset-by-point" capability #105

agstephens opened this issue Sep 8, 2021 · 13 comments
Assignees

Comments

@agstephens
Copy link
Contributor

agstephens commented Sep 8, 2021

Create "subset-by-point" capability in the roocs stack. This is split into these issues:

See below for the overview of the plan.

@agstephens
Copy link
Contributor Author

agstephens commented Sep 8, 2021

Level and temporal subsetting by point in roocs

Requirement

At present, the user can only select times or levels by giving a range: <start>/<end>

We want to add the ability to specify subsetting by-point, such as:

time=2021-01-01,2021-01-11,2021-01-21
level=850,500,100,50

Concerns

This needs to be implemented in roocs-utils, clisops, daops, rook and rooki. It can complement the existing approach that uses a range, but we need to make sure that the parameterization method works for the cases of:

  1. a range: (start, end)
  2. a selection of two values by-point: (point1, point2)

Constraints

This change will apply to: level and time.

This change will not apply to: area !

Implementation

Implementing subset-by-level

The level subsetting can follow this approach:

  • display (C3S form): checkbox per available level
  • value: a list of levels
  • API value: comma-separated string
  • Rules:
    • if no / and no ,: treat as single value
    • if , found: treat as sequence
    • if <start>/<end>: treat as range
    • cannot provide / and , together: raise an exception

Implementing subset-by-time

The time subsetting can follow this approach:

  • display (C3S form) - selectors per time component:
    • year: 1991, 1992, ..., 2020
    • month: 01, 02,..., 12
    • day: 01, 02,..., 31
  • constraints:
    • all days (up to 31) available for all months
    • the service will ignore invalid dates and only return the dates that are in the time array
  • value: e.g. year=1991,2001,2010&month=01,02,03&day=10,20,30
  • API implications:
    • Either provide time parameter (as currently implemented) or year, month, day
      • if frequency is monthly, then day is optional/ignored (but month is required)
      • if frequency is annual, then month and day are optional/ignored
      • in both cases above, the service can do:
        • try work out the frequency from the input data (e.g. look at the first three time steps)
        • do the check against the inferred frequency
    • If time parameter found, ignore year, month, day

Exception handling

Unsorted input list

  • if user makes request that is unsorted, e.g.: "1000,250,850,10,500"
    • we will sort request before subsetting, e.g.: "1000,850,500,250,10"
      • ascending or descending?
        • examine the order of the input data array
        • convert the user's request to the order of the input data array

Invalid value provided

  • if user requests an invalid value:
    • raise Exception(tell them which values were incorrect in the request)

Repeated value(s)

  • if user requests a repeated value:
    • remove the duplicate value in the request parameter, and subset as normal

@agstephens
Copy link
Contributor Author

@cehbrecht: Please review and let me know if any of the above doesn't make sense. Thanks

@agstephens
Copy link
Contributor Author

Here are the affected modules/notebooks:

roocs-utils/docs/api.rst:Parameters
roocs-utils/notebooks/examples.ipynb:       "['AreaParameter',\n",
roocs-utils/notebooks/examples.ipynb:       " 'CollectionParameter',\n",
roocs-utils/notebooks/examples.ipynb:       " 'LevelParameter',\n",
roocs-utils/notebooks/examples.ipynb:       " 'TimeParameter',\n",
roocs-utils/notebooks/examples.ipynb:    "Parameters classes are used to parse inputs of collection, area, time and level used as arguments in the subsetting operation"
roocs-utils/notebooks/examples.ipynb:    "area = roocs_utils.AreaParameter(\"0.,49.,10.,65\")\n",
roocs-utils/notebooks/examples.ipynb:    "collection = roocs_utils.CollectionParameter(\"cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga,cmip5.output1.MPI-M.MPI-ESM-LR.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga\")\n",
roocs-utils/notebooks/examples.ipynb:    "level = roocs_utils.LevelParameter((1000.50, 2000.60))\n",
roocs-utils/notebooks/examples.ipynb:    "time = roocs_utils.TimeParameter(\"2085-01-01T12:00:00Z/2120-12-30T12:00:00Z\")\n",
roocs-utils/notebooks/examples.ipynb:    "Parameterise parameterises inputs to instances of parameter classes which allows them to be used throughout roocs."
roocs-utils/roocs_utils/parameter/area_parameter.py:from roocs_utils.parameter.base_parameter import _BaseParameter
roocs-utils/roocs_utils/parameter/area_parameter.py:class AreaParameter(_BaseParameter):
roocs-utils/roocs_utils/parameter/area_parameter.py:                    raise InvalidParameterValue("Area values must be a number")
roocs-utils/roocs_utils/parameter/area_parameter.py:                    raise InvalidParameterValue("Area values must be a number")
roocs-utils/roocs_utils/parameter/base_parameter.py:from roocs_utils.exceptions import InvalidParameterValue
roocs-utils/roocs_utils/parameter/base_parameter.py:from roocs_utils.exceptions import MissingParameterValue
roocs-utils/roocs_utils/parameter/base_parameter.py:class _BaseParameter(object):
roocs-utils/roocs_utils/parameter/collection_parameter.py:from roocs_utils.parameter.base_parameter import _BaseParameter
roocs-utils/roocs_utils/parameter/collection_parameter.py:class CollectionParameter(_BaseParameter):
roocs-utils/roocs_utils/parameter/collection_parameter.py:            raise MissingParameterValue(f"{self.__class__.__name__} must be provided")
roocs-utils/roocs_utils/parameter/dimension_parameter.py:from roocs_utils.exceptions import InvalidParameterValue
roocs-utils/roocs_utils/parameter/dimension_parameter.py:from roocs_utils.parameter.base_parameter import _BaseParameter
roocs-utils/roocs_utils/parameter/dimension_parameter.py:class DimensionParameter(_BaseParameter):
roocs-utils/roocs_utils/parameter/level_parameter.py:from roocs_utils.exceptions import InvalidParameterValue
roocs-utils/roocs_utils/parameter/level_parameter.py:from roocs_utils.parameter.base_parameter import _BaseParameter
roocs-utils/roocs_utils/parameter/level_parameter.py:class LevelParameter(_BaseParameter):
roocs-utils/roocs_utils/parameter/parameterise.py:    Parameterises inputs to instances of parameter classes which allows
roocs-utils/roocs_utils/parameter/parameterise.py:    :return: Parameters as instances of their respective classes.
roocs-utils/roocs_utils/parameter/parameterise.py:        collection = collection_parameter.CollectionParameter(collection)
roocs-utils/roocs_utils/parameter/parameterise.py:    area = area_parameter.AreaParameter(area)
roocs-utils/roocs_utils/parameter/parameterise.py:    time = time_parameter.TimeParameter(time)
roocs-utils/roocs_utils/parameter/parameterise.py:    level = level_parameter.LevelParameter(level)
roocs-utils/roocs_utils/parameter/time_parameter.py:from roocs_utils.exceptions import InvalidParameterValue
roocs-utils/roocs_utils/parameter/time_parameter.py:from roocs_utils.parameter.base_parameter import _BaseParameter
roocs-utils/roocs_utils/parameter/time_parameter.py:class TimeParameter(_BaseParameter):
roocs-utils/roocs_utils/parameter/__init__.py:from .area_parameter import AreaParameter
roocs-utils/roocs_utils/parameter/__init__.py:from .collection_parameter import CollectionParameter
roocs-utils/roocs_utils/parameter/__init__.py:from .level_parameter import LevelParameter
roocs-utils/roocs_utils/parameter/__init__.py:from .time_parameter import TimeParameter
clisops/clisops/ops/average.py:from roocs_utils.parameter.dimension_parameter import DimensionParameter
clisops/clisops/ops/average.py:        dims = DimensionParameter(params.get("dims", None)).tuple
clisops/clisops/ops/average.py:    dims: Optional[Union[Tuple[str], DimensionParameter]] = None,
clisops/clisops/ops/average.py:    dims : Optional[Union[Tuple[str], DimensionParameter]]
clisops/clisops/ops/base_operation.py:        Parameters that are specific to each operation are handled in:
clisops/clisops/ops/regrid.py:        # we use roocs_utils.exceptions.InvalidParameterValue if an input isn't right
clisops/clisops/ops/subset.py:from roocs_utils.parameter.area_parameter import AreaParameter
clisops/clisops/ops/subset.py:from roocs_utils.parameter.level_parameter import LevelParameter
clisops/clisops/ops/subset.py:from roocs_utils.parameter.time_parameter import TimeParameter
clisops/clisops/ops/subset.py:    time: Optional[Union[str, Tuple[str, str], TimeParameter]] = None,
clisops/clisops/ops/subset.py:            AreaParameter,
clisops/clisops/ops/subset.py:            str, Tuple[Union[int, float, str], Union[int, float, str]], LevelParameter
clisops/clisops/ops/subset.py:    time: Optional[Union[str, Tuple[str, str], TimeParameter]] = None,
clisops/clisops/ops/subset.py:            AreaParameter
clisops/clisops/ops/subset.py:    level: Optional[Union[str, Tuple[Union[int, float, str], Union[int, float, str]], LevelParameter]
clisops/notebooks/average_over_dims.ipynb:    "# Parameters\n",
clisops/notebooks/average_over_dims.ipynb:    "Parameters taken by the `average_over_dims` are below:\n",
clisops/notebooks/average_over_dims.ipynb:    "    dims : Optional[Union[Tuple[str], DimensionParameter]]\n",
clisops/notebooks/average_over_dims.ipynb:    "from roocs_utils.exceptions import InvalidParameterValue\n",
clisops/notebooks/average_over_dims.ipynb:    "except InvalidParameterValue as exc:\n",
clisops/notebooks/average_over_dims.ipynb:    "except InvalidParameterValue as exc:\n",
clisops/notebooks/subset.ipynb:    "Parameters\n",
daops/daops/catalog/util.py:    start, end = time_parameter.TimeParameter(time).tuple
daops/daops/ops/average.py:        dims = dimension_parameter.DimensionParameter(params.get("dims"))
daops/daops/ops/average.py:        collection = collection_parameter.CollectionParameter(collection)
daops/daops/ops/base.py:        self.collection = collection_parameter.CollectionParameter(collection)
daops/daops/ops/regrid.py:        collection = collection_parameter.CollectionParameter(collection)
daops/daops/utils/consolidate.py:    :param collection: (roocs_utils.CollectionParameter) The collection of datasets to process.
rook/rook/director/alignment.py:            start, end = time_parameter.TimeParameter(time).tuple
rook/rook/processes/wps_average.py:        # from roocs_utils.exceptions import InvalidParameterValue, MissingParameterValue
rook/rook/processes/wps_subset.py:        # from roocs_utils.exceptions import InvalidParameterValue, MissingParameterValue

@agstephens
Copy link
Contributor Author

agstephens commented Sep 9, 2021

Maybe we need something like:

class RangeSelector:
   """
   A simple class for handling a range selection of any type.
   It holds a `start` and `end` but does not try to resolve
   the range, it is just a container to be used by other tools.
   The contents can be of any type, such as datetimes, strings etc.
   """

   def __init__(self, start, end):
       self.start = start
       self.end = end

   def tuple(self):
       return (self.start, self.end)

Or maybe it could just be a property of the Parameter class, i.e. parameter_type: "range" or "sequence"

@cehbrecht
Copy link

cehbrecht commented Sep 9, 2021

Looking at ISO-8601 standard for date/time and intervals. Available libraries:

The extension of ISO-8601 also supports "seasons" using special month values, like 21 for spring season:
https://www.loc.gov/standards/datetime/iso-tc154-wg5_n0039_iso_wd_8601-2_2016-02-16.pdf

The official ISO-8601 spec is unfortunately not public.

The ISO approach does not seem to fit for us.

I would like to keep only the time parameter. Though it is not concise we could have a list of values for seasons and special days ... needs to be generated by C3S:

time = 1970-01, 1970-02, 1970-03, 1971-01, 1971-02, 1971-03

Or using a special notation?:

time = 1970{01, 02, 03},1971{01, 02, 03}

@agstephens
Copy link
Contributor Author

More thoughts about selecting times from year, month etc...

>>> year_idxs = ds.groupby('time.year').groups
>>> month_idxs = ds.groupby('time.month').groups
>>> years = (1990, 1995)
>>> months = (1, 2, 3)
>>> time_indexes = set([idx for year in years for idx in year_idxs[year]]).intersection(set([idx for month in months for idx in month_idxs[month]]))

>>> ds.isel(time=list(time_indexes)).time
<xarray.DataArray 'time' (time: 6)>
array([cftime.DatetimeGregorian(1990, 1, 16, 12, 0, 0, 0),
       cftime.DatetimeGregorian(1990, 2, 15, 0, 0, 0, 0),
       cftime.DatetimeGregorian(1990, 3, 16, 12, 0, 0, 0),
       cftime.DatetimeGregorian(1995, 1, 16, 12, 0, 0, 0),
       cftime.DatetimeGregorian(1995, 2, 15, 0, 0, 0, 0),
       cftime.DatetimeGregorian(1995, 3, 16, 12, 0, 0, 0)], dtype=object)
Coordinates:
  * time     (time) object 1990-01-16 12:00:00 ... 1995-03-16 12:00:00
Attributes:
    bounds:   time_bnds

@agstephens
Copy link
Contributor Author

I would like to keep only the time parameter. Though it is not concise we could have a list of values for seasons and special days ... needs to be generated by C3S:

time = 1970-01, 1970-02, 1970-03, 1971-01, 1971-02, 1971-03

Or using a special notation?:

time = 1970{01, 02, 03},1971{01, 02, 03}

@cehbrecht Let's discuss this when we chat. I had assumed we would extend the interface to support "year", "month" and "day" as parameters - but maybe we can push them all through "time" (in some clever way).

@agstephens
Copy link
Contributor Author

agstephens commented Sep 10, 2021

Ideas:

time=year:1999,2000,2001+month:01,02,03+day:01,02,03
time=;year=1999,2000,2001;month:01,02,02;day=01,02,03; 
time=1999-01-01T00/2100-12-30T23;month:01;day=01;hour=00;

What about if you want to select:
time = 1999-04-01, 2008-09-09, 2111-03-12 ???

level="850/100"  OR  "850,100"   OR   "850"   OR   ""

This might work for time, by separating out "year", "month" etc in time_components:

time= 1999-01-01T00 / 2100-12-30T23 (range)
or
time= 1999-04-01, 2008-09-09, 2111-03-12 (sequence)

and:

time_components=month=jan,feb+day=01,02,03+hour=00,12
time_components=month=01,02|day=01,02,03|hour=00,12

What should our separator be? + or | are good options.

@cehbrecht
Copy link

cehbrecht commented Sep 13, 2021

Another iteration on the time parameter.

For time it is clear ... it would (more/less) have the time format of ISO 8601:

# range 
time = 1970-01-01/1990-12-31
time = 1970/
time = /1990-12
time = 1970-01-01T12:00:00/1990-12-31T12:00:00

# time points
time = 1970-01-01, 1971-01-01, 1972-01-01
time = 1970-01-01T06, 1970-01-01T18

For time divisions it is a bit tricky. A name for the parameter could be:

  • time_components
  • time_division
  • time_rule
  • time_constraints

The preferred syntax would be like this:

time_division = years=1970,1980;months=01,02,03

OR

time_rule = years=1970,1980;months=01,02,03

When using WPS Get requests we have a conflict with the WPS syntax:
http://geoprocessing.info/wpsdoc/1x0ExecuteGETEncoding

We only use Get requests for testing. It will work with Post requests which are by default used in Rooki.

To get around this issue we can use double-encoding of the time_division parameter:

In [1]: import urllib

In [2]: urllib.parse.urlencode([('time_divison', 'years=1970,1980;months=01,02,03')])
Out[2]: 'time_divison=years%3D1970%2C1980%3Bmonths%3D01%2C02%2C03'

In [3]: urllib.parse.urlencode([('time_divison', 'years%3D1970%2C1980%3Bmonths%3D01%2C02%2C03')])
Out[3]: 'time_divison=years%253D1970%252C1980%253Bmonths%253D01%252C02%252C03'

http://localhost:5000/wps?service=wps&version=1.0.0&request=execute&lineage=true&identifier=subset&datainputs=collection=cmip6.test;time_divison=years%253D1970%252C1980%253Bmonths%253D01%252C02%252C03

The value is correctly decoded by pywps on the server side.

We can add a helper function to pywps or rook to provide the double-encoding for our tests.

If we want to avoid double-encoding but keep the "intuitive" syntax we need to add more wps parameters:

time = 1900/1990
time_divsion_months = 01,02,03
time_divsion_days = 01

OR shorter

time = 1900/1990
M = 01,02,03
D = 01

@agstephens
Copy link
Contributor Author

I'm still quite fond of:

time_components=month:01,02|day:01,02,03|hour:00,12

@agstephens
Copy link
Contributor Author

Our final decision was:

  1. For time and level selections, you cannot just send a tuple of two values because the code doesn't know if you mean an interval or a series of values.
  • solution: wrap it in a constructor: time_series(...) or time_interval(...) - or the same for level.
  • implications: go all the way into roocs-utils and affect most unit tests in parts of the stack.
  1. For time component selections:
  • in rook: time_components=month:01,02|day:01,02,03|hour:00,12
  • in clisops/daops: optionally use the constructor: time_components(month=(1, 2), day=(1, 2, 3), hour=(0, 12) - which can also translate month names/abbreviations into numbers.

@agstephens
Copy link
Contributor Author

@huard @Zeitsperre @aulemahal @tlogan2000: I am just tagging you all regarding updates we are making to the roocs stack. These will affect the dependency roocs-utils and changes are made in clisops.core.subset and clisops.ops.subset.

However, once all in place, you shouldn't see any changes to the clisops.core code that you depend on.

The summary of changes is in the cell above (#105 (comment)).

@agstephens
Copy link
Contributor Author

Could we use a python slice for an interval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants