Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fix DatashaderRasterizer for GeoDataFrame wrapped in StreamWrapper #104

Merged
merged 3 commits into from
May 30, 2023

Conversation

weiji14
Copy link
Owner

@weiji14 weiji14 commented May 30, 2023

In DatashaderRasterizer (from zen3geo v0.3.0 to v0.6.0), the conversion of a geopandas.GeoDataFrame to spatialpandas.GeoDataFrame happens in this try-except statement:

# Convert vector to spatialpandas format to allow datashader's
# rasterization methods to work
try:
columns = ["geometry"] if not hasattr(vector, "columns") else None
_vector = spatialpandas.GeoDataFrame(data=vector, columns=columns)
except ValueError as e:
raise NotImplementedError(
f"Unsupported geometry type(s) {set(vector.geom_type)} detected, "
"only point, line or polygon vector geometry types are supported."
) from e

This was added at commit 6805418 in #35. In some rare cases, the conversion fails when the vector geometry is wrapped in a StreamWrapper class (but not always?), due to spatialpandas not being able to detect the presence of a geometry column. The traceback looks like this:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/mambaforge/envs/mlpipeline/lib/python3.10/site-packages/zen3geo/datapipes/datashader.py:218, in DatashaderRasterizerIterDataPipe.__iter__(self)
    217     columns = ["geometry"] if not hasattr(vector, "columns") else None
--> 218     _vector = spatialpandas.GeoDataFrame(data=vector, columns=columns)
    219 except ValueError as e:

File ~/mambaforge/envs/mlpipeline/lib/python3.10/site-packages/spatialpandas/geodataframe.py:35, in GeoDataFrame.__init__(self, data, index, geometry, **kwargs)
     34 if first_geometry_col is None:
---> 35     raise ValueError(
     36         "A spatialpandas GeoDataFrame must contain at least one spatialpandas "
     37         "GeometryArray column"
     38     )
     40 if geometry is None:

ValueError: A spatialpandas GeoDataFrame must contain at least one spatialpandas GeometryArray column

The above exception was the direct cause of the following exception:

NotImplementedError                       Traceback (most recent call last)
Cell In[54], line 2
      1 it = iter(dp_datashader)
----> 2 tmp = next(it)
      4 torchdata.datapipes.utils.to_graph(dp=dp_datashader)

File ~/mambaforge/envs/mlpipeline/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py:173, in hook_iterator.<locals>.wrap_generator(*args, **kwargs)
    171         response = gen.send(None)
    172 else:
--> 173     response = gen.send(None)
    175 while True:
    176     datapipe._number_of_samples_yielded += 1

File ~/mambaforge/envs/mlpipeline/lib/python3.10/site-packages/zen3geo/datapipes/datashader.py:220, in DatashaderRasterizerIterDataPipe.__iter__(self)
    218     _vector = spatialpandas.GeoDataFrame(data=vector, columns=columns)
    219 except ValueError as e:
--> 220     raise NotImplementedError(
    221         f"Unsupported geometry type(s) {set(vector.geom_type)} detected, "
    222         "only point, line or polygon vector geometry types are supported."
    223     ) from e
    225 # Determine geometry type to know which rasterization method to use
    226 vector_dtype: spatialpandas.geometry.GeometryDtype = _vector.geometry.dtype

NotImplementedError: Unsupported geometry type(s) {'Polygon'} detected, only point, line or polygon vector geometry types are supported.
This exception is thrown by __iter__ of DatashaderRasterizerIterDataPipe(agg=None, kwargs={}, source_datapipe=XarrayCanvasIterDataPipe, vector_datapipe=PyogrioReaderIterDataPipe)

Relevant logic in spatialpandas is at https://github.com/holoviz/spatialpandas/blob/aeeba42751fd7cc16fd7e4a142b053fb9cd0c033/spatialpandas/geodataframe.py#L26-L32

Since spatialpandas doesn't use ducktyping, fixing this would require the vector variable to be serialized from a StreamWrapped instance to a geopandas.GeoDataFrame or geopandas.GeoSeries object. This could happen by:

  1. Calling vector.geometry which returns a geopandas.GeoSeries, and pass that into spatialpandas.GeoDataFrame.
  2. Calling vector.loc[:] to return a view of the geopandas.GeoDataFrame or geopandas.GeoSeries object.

The bugfix in this PR will use (1), basically following the original logic prior to 6805418 in #35. Disadvantage is that the columns from the GeoDataFrame will be lost on converting to GeoSeries, but we are not making use of any other column beside the geometry column on the rasterization step anyway (though it is usually nice to not delete data until necessary).

Also, this PR makes the try-except statement catch a more specific ValueError (since the 'Polygon' dtype should be supported). Ideally there would be a unit test to cover the ValueError: A spatialpandas GeoDataFrame must contain at least one spatialpandas GeometryArray column case, but it's hard to create a minimal reproducible example, so just testing something that hits the same code line instead.

Patches #35.

On converting a vector geometry in a geopandas.GeoDataFrame (which could be wrapped in StreamWrapper) to a spatialpandas.GeoDataFrame, there could be several different types of `ValueError`s raised. This modifies the exception raising to target only the one specific ValueError caused by invalid geometry type. See logic at https://github.com/holoviz/spatialpandas/blame/v0.4.8/spatialpandas/geometry/base.py#L805-L849 for how the original ValueError is raised. Also clarified that MultiPoint, MultiLineString and MultiPolygon geometry types are supported.
Probably wanted to preserve all the columns when converting from geopandas.GeoDataFrame to spatialpandas.GeoDataFrame, but it doesn't work sometimes when the vector is wrapped by StreamWrapper. Decided to pass the vector.geometry GeoSeries as input instead (alternative was to do a view like vector.loc[:]). Partially reverts 6805418 in #35.

Wanted to add a unit test, but it was hard to get a minimal reproducible example. Only know that this helps with a complicated data pipeline reading vector GeoJSON data from a HTTP request.
@weiji14 weiji14 added the bug Something isn't working label May 30, 2023
@weiji14 weiji14 added this to the 0.6.x milestone May 30, 2023
@weiji14 weiji14 self-assigned this May 30, 2023
@weiji14 weiji14 marked this pull request as ready for review May 30, 2023 01:55
Test to ensure that the ValueError raised when an invalid geopandas.GeoDataFrame is passed into DatashaderRasterizer is not about unsupported geometry type, but something else instead. Not exactly a perfect regression test for #104, but it does help with code coverage.
@weiji14 weiji14 merged commit ec7ca05 into main May 30, 2023
@weiji14 weiji14 deleted the spatialpandas/geodataframe_from_wrapped_vector branch May 30, 2023 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant