Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polygon subsetting #285

Merged
merged 58 commits into from
Dec 18, 2019
Merged

Polygon subsetting #285

merged 58 commits into from
Dec 18, 2019

Conversation

Zeitsperre
Copy link
Collaborator

@Zeitsperre Zeitsperre commented Nov 1, 2019

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
  • Tests for the changes have been added (for bug fixes / features)
  • Docs have been added / updated (for bug fixes / features)
  • HISTORY.rst has been updated

After merging to master and before closing the branch:

  • bumpversion (minor / major / patch) has been called on master branch
  • Tags have been pushed
  • What kind of change does this PR introduce?
    This adds the subset.subset_shape() function which allows polygon subsetting using a GeoJSON, Shapefile, single-layer GeoPackage, or any other fiona readable formats. The function can be called as follows:
    ds_subset = xclim.subset.subset_shape(ds, shape=geo_file)

  • Does this PR introduce a breaking change?
    No*. Because of the need to project NetCDF files in order for them to be understoud by tools like rasterio, netCDFs that typically don't have CRS information need to be converted via rioxarray. This requires users to have loaded their NetCDF with the rioxarray library installed. Examples can be found in the docstring for the function.

  • Other information:
    rioxarray opens up new opportunities for features, such as writing DataArrays directly to GeoTIFF format and other GIS functions. This demands more investigation.

@Zeitsperre Zeitsperre added the enhancement New feature or request label Nov 1, 2019
@Zeitsperre Zeitsperre added this to the v0.12 milestone Nov 1, 2019
@Zeitsperre Zeitsperre self-assigned this Nov 1, 2019
@Zeitsperre Zeitsperre requested a review from huard November 1, 2019 21:51
@coveralls
Copy link

coveralls commented Nov 1, 2019

Pull Request Test Coverage Report for Build 1354

  • 118 of 166 (71.08%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-1.8%) to 89.54%

Changes Missing Coverage Covered Lines Changed/Added Lines %
xclim/subset.py 118 166 71.08%
Totals Coverage Status
Change from base Build 1319: -1.8%
Covered Lines: 5136
Relevant Lines: 5736

💛 - Coveralls

Copy link
Collaborator

@huard huard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of the decorator switching shape to geometry. I think it obfuscate what's going on.

Add a small test.
Add a notebook example.

xclim/subset.py Show resolved Hide resolved
xclim/subset.py Show resolved Hide resolved
xclim/subset.py Outdated Show resolved Hide resolved
xclim/subset.py Outdated Show resolved Hide resolved
xclim/subset.py Outdated Show resolved Hide resolved
@huard
Copy link
Collaborator

huard commented Nov 4, 2019

FYI, this is how ESMValTool does it: https://github.com/ESMValGroup/ESMValCore/blob/92ba9538b51eec92db578c3cb92eb343d9a93e24/esmvalcore/preprocessor/_area.py#L313

How does it compare to your approach ?

@Zeitsperre
Copy link
Collaborator Author

@huard No comparison. Much of the work I do is to ensure that the data is well understoud by rioxarray. I'm operating under the assumption that that extension is optimized for cutting data.

From my interactions with the maintainer of rioxarray, they seem keen to see contributors suggest changes. I can possibly contribute some of this handling to that project to help simplify the logic here. One gain I have in my approach is that it handles multi-dimensional data fairly well while the approach you linked, from what I can tell only handles 2-dimensional data.

@tlogan2000
Copy link
Collaborator

FYI Failed tests for subset.py seem to be due to the release of cftime>=1.0.4. released 10/21/2019. https://github.com/Unidata/cftime

XARRAY devs seems to be on it as the build versus their master seem to pass.

@huard
Copy link
Collaborator

huard commented Nov 4, 2019

@Zeitsperre I'm simply wary of

  1. leveraging new emerging libraries with few contributors,
  2. adding dependencies.

I'm not against it, but it should be the result of an exercise of pros vs cons, especially if you feel we'll need to invest time in upstream development.

On the minus side, I see for example that it breaks on Python3.5, which I think we're still running internally, right ?

On the plus side, we could replace our subset_bbox by the rioxarray implementation, and I see other valuable features that we could exploit.

I'd be surprised ESMValTool's subset only applied to 2D fields. What makes you think that ?

@Zeitsperre
Copy link
Collaborator Author

@huard I'm with you on this.

Fairly disappointed about the Python3.5 non-support. PyPI reported that it was 3.5-compatible so I figured the failures were due to my code. This may be a bust for the time being. Regrettably, I can't port this to miranda either as I'm trying to keep it Python3.5-compatible as well. It might be best to put a pin in this until a later time (or make a few functions exclusive to 3.6+, an idea I really don't like).

I was under that impression ESMValTools doesn't support it based on the logic of some of the checks. I should give it another look. The other method out there is to leverage GeoPandas and Shapely to make a virtual grid that's used to perform basic map algebra. It's currently not clean but I can try that approach next. Too bad.

@huard
Copy link
Collaborator

huard commented Nov 4, 2019

xarray is dropping support for 3.5, so we'll have to drop it too.

Basically, I'd like you to take stock of what our options are, and suggest a solution that sounds solid from a development and maintenance point of view.

@Zeitsperre
Copy link
Collaborator Author

Zeitsperre commented Nov 5, 2019

@huard As I mentioned briefly in person to @tlogan2000 and yourself, here's a game plan:

  • Dropping Python3.5 can happen eventually. Xarray isn't ensuring builds using 3.5 are passing so eventually, this will be the case here. Let's open a ticket and set the next minor version goals to drop 3.5 as well.

  • The rioxarray is a neat library and approach to this problem, so rather than erase the work I've done, I will be porting the subsetting code to https://github.com/Ouranosinc/miranda as I have already used it in conjunction with that library for other projects. The code for manually warping the lats and lons of the NetCDF is something that the rioxarray maintainer is interested in that I'll try contributing in my free time. Once that project has matured a little, stabilized its API, or been integrated into xarray more formally, we can integrate it here.

  • I'm going to explore the approach for subsetting polygons using GeoPandas. rather than modify the grids of the NetCDFs, reprojecting and manually converting geocoordinates of GeoPandas means we aren't manipulating the NetCDFs as much, which is probably better for a version 1 of the function. We can also leverage GeoPandas for other GIS-like functions and optimize the subset_bbox function at the same time. It's heavier on dependencies but we know it works.

I should be able to start on this later today or tomorrow.

@huard huard added the priority Immediate priority label Nov 18, 2019
@Zeitsperre
Copy link
Collaborator Author

@huard Most of it seems to be working. There's a persistent error on the Python3.8 build that makes me think that rasterio and xarray are not getting along. It seems to be an HDF5 file access error. Have you run into this in the past?

@tlogan2000
Copy link
Collaborator

@huard Most of it seems to be working. There's a persistent error on the Python3.8 build that makes me think that rasterio and xarray are not getting along. It seems to be an HDF5 file access error. Have you run into this in the past?

Do we still need rasterio for this PR?

@RondeauG
Copy link
Contributor

@huard Most of it seems to be working. There's a persistent error on the Python3.8 build that makes me think that rasterio and xarray are not getting along. It seems to be an HDF5 file access error. Have you run into this in the past?

What's the error, exactly? I haven't tested your new version of the polygon subsetting, but last week I kept having the following error when using it alongside engine='h5netcdf' and dask (seemed to work fine without dask, though): ValueError: cannot set WRITEABLE flag to True of this array

Googling this error points towards an incompatibility with pandas (or possibly geopandas) and hdf5: pandas-dev/pandas#24839 and https://stackoverflow.com/questions/54210073/pd-read-hdf-throws-cannot-set-writable-flag-to-true-of-this-array

@Zeitsperre
Copy link
Collaborator Author

Okay, I want to say it's 98% there. We just need to figure out how to install libspatialindex-dev on the Windows build and I want to say we can push this. @huard What say you?

@Zeitsperre
Copy link
Collaborator Author

@huard Will merge this as soon as Travis CI clears. Windows build is still not working, but that's not due to the library, but the trickiness of its dependencies. Will push a new version with that and then we can bump the xarray version and all the dependencies. Work for you?

@Zeitsperre Zeitsperre merged commit efc530c into master Dec 18, 2019
@Zeitsperre Zeitsperre deleted the polygon_subsetting branch December 18, 2019 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority Immediate priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants