geopandas: Mapping int/int64 to int32 for OGR_GMT format #2592

weiji14 · 2023-07-04T02:43:33Z

Need to add a unit test to pygmt/tests/test_geopandas.py that uses int/int64 columns. Probably a good idea to check uint/uint64 too.

Probably a good idea to check uint/uint64 too.

See line https://github.com/geopandas/geopandas/blob/35463adda663dfa57c9809865ebbb15448823278/geopandas/io/file.py#L640

out_type = type(np.zeros(1, in_type).item()).__name__

For any integer-like data types (like int, int32, int64, unit, uint64 and more), the out_type is always int.

Ah ok, then a unit test with just int64 columns should be fine, no need for parametrized tests.

@weiji14 I have never used geopandas and I find it challenging for me to add a geopandas test for this PR. Could you please help with it?

Getting a little late for me, but let me post the test example I'm working on:

import geopandas as gpd import pandas as pd from pygmt import Figure, info, which, makecpt shapefile = which( fname="@RidgeTest.shp @RidgeTest.shx @RidgeTest.dbf @RidgeTest.prj", download="c", ) gdf = gpd.read_file(shapefile[0]) gdf["geometry"] = gdf.to_crs(crs="EPSG:3857").buffer(distance=100000) gdf["NPOINTS"] = gdf.NPOINTS.astype(dtype="int32") gdf.plot() fig = Figure() makecpt(cmap="lajolla", series=[10, 60, 10], continuous=True) fig.plot( data=gdf[["NPOINTS", "geometry"]], projection="X5c", frame=True, pen="1p,black", close=True, fill="+z", cmap=True, aspatial="Z=NPOINTS", ) fig.colorbar() fig.show()

Couldn't get it to plot though, getting some weird errors like:

plot [ERROR]: Bad OGR/GMT: @n not allowed before FEATURE_DATA plot [ERROR]: Aspatial associations set with -a but input file is not in OGR/GMT format!

Might be another bug? I'll try to come up with an alternative example tomorrow.

pygmt-2m6nvbqu.gmt.txt

This is the temporary OGR/GMT file generated by tempfile_from_geojson (with an extra .txt suffix so that I can upload it to GitHub).

The first few lines are:

# @VGMT1.0 @GPOLYGON # @R-4506555.399/-3052902.89568/3804018.88695/5612191.52675 # @Je3857 # @Jp"+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs" # @Jw"PROJCS[\"WGS 84 / Pseudo-Mercator\",GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563,AUTHORITY[\"EPSG\",\"7030\"]],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AUTHORITY[\"EPSG\",\"4326\"]],PROJECTION[\"Mercator_1SP\"],PARAMETER[\"central_meridian\",0],PARAMETER[\"scale_factor\",1],PARAMETER[\"false_easting\",0],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH],EXTENSION[\"PROJ4\",\"+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs\"],AUTHORITY[\"EPSG\",\"3857\"]]" # @NNPOINTS # @Tstring # FEATURE_DATA >

Apparently, GMT incorrectly parse the @null string in the @Jp and @Jw lines as an invalid OGR/GMT code. So it looks like an upstream bug.

Edit: opened a bug report in GenericMappingTools/gmt#7712.

Oo, didn't notice that @null. I see at https://proj.org/en/9.2/usage/transformation.html#the-null-grid that you can specify null, by why would there be an @ symbol?!! A little bit of searching in the https://github.com/OSGeo/PROJ repo shows that this might be valid though?

Do we need to report this to GMT, geopandas or PROJ? In the meantime, I can try a projection that doesn't have @null in the PROJ string.

I see at https://proj.org/en/9.2/usage/transformation.html#the-null-grid that you can specify null, by why would there be an @ symbol?!!

I think @ means the grid is optional (see https://github.com/OSGeo/PROJ/blob/0372c81511a11afa76e01ebb30adb00bcb950772/NEWS#L2936). Also see the search results of =@: https://github.com/search?q=repo%3AOSGeo%2FPROJ+%3D%40&type=code

Ok, I've added the unit test at commit c1b3b43, and used a projection that doesn't have @null, feel free to modify if needed.

pygmt-2m6nvbqu.gmt.txt

This is the temporary OGR/GMT file generated by tempfile_from_geojson (with an extra .txt suffix so that I can upload it to GitHub).

The first few lines are:

# @VGMT1.0 @GPOLYGON # @R-4506555.399/-3052902.89568/3804018.88695/5612191.52675 # @Je3857 # @Jp"+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs" # @Jw"PROJCS[\"WGS 84 / Pseudo-Mercator\",GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563,AUTHORITY[\"EPSG\",\"7030\"]],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AUTHORITY[\"EPSG\",\"4326\"]],PROJECTION[\"Mercator_1SP\"],PARAMETER[\"central_meridian\",0],PARAMETER[\"scale_factor\",1],PARAMETER[\"false_easting\",0],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH],EXTENSION[\"PROJ4\",\"+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs\"],AUTHORITY[\"EPSG\",\"3857\"]]" # @NNPOINTS # @Tstring # FEATURE_DATA >

Apparently, GMT incorrectly parse the @null string in the @Jp and @Jw lines as an invalid OGR/GMT code. So it looks like an upstream bug.

Edit: opened a bug report in GenericMappingTools/gmt#7712.

FYI, the upstream bug has been fixed.

weiji14 · 2023-07-04T02:52:15Z

Hmm, looking into this infer_schema function and geopandas/geopandas#967 (comment) a bit more, I almost think part of the fix should go upstream into modifying infer_schema to output int32 values when the column is of int32 dtype, instead of forcing things to int/int64.

Of course, we'll still need to force the OGR_GMT output to int32 in PyGMT, but the code would be much simplified with some upstream fixes.

Hmm, looking into this infer_schema function and geopandas/geopandas#967 (comment) a bit more, I almost think part of the fix should go upstream into modifying infer_schema to output int32 values when the column is of int32 dtype, instead of forcing things to int/int64.

There is already an open PR geopandas/geopandas#2265, but it has been inactive for one year.

There is already an open PR geopandas/geopandas#2265, but it has been inactive for one year.

Ok, and I saw geopandas/geopandas#2464 as well, also inactive 😆

Started a small PR at geopandas/geopandas#2950 to just handle int32, let's see if their test suite passes, I took a look at https://fiona.readthedocs.io/en/latest/manual.html#keeping-schemas-simple and fiona.FIELD_TYPES_MAP seem to have the following mapping:

{'int32': int, 'float': float, 'str': str, 'date': fiona.rfc3339.FionaDateType, 'time': fiona.rfc3339.FionaTimeType, 'datetime': fiona.rfc3339.FionaDateTimeType, 'bytes': bytes, 'int64': int, 'int': int, 'List[str]': typing.List[str]}

So not sure if the geopandas default driver (ESRI Shapefile?) supports int32 schema. It it doesn't and the test suite breaks, then we might need to handle everything fully in PyGMT.

It's still unclear if there will be any changes in GDAL, GeoPandas, Fiona or GMT, but as we are still using old versions of these packages, I believe we still need these workarounds in the next few years, right?

So the patch in geopandas has been merged at geopandas/geopandas#2950, but will need to wait for geopandas v0.13.3 or newer to be released. Even so, int64 will still need to be converted to int32, so yes, we'll need to keep this workaround for a while.

weiji14 · 2023-07-04T02:39:29Z

If we're going to need to read the schema anyway, should we just use the pure fiona and no geopandas solution I mentioned at #1000 (comment)? Then we can remove the try-except block and the geopandas import?

Sounds a good idea, although I have no experience with __geo_interface__.

If we're going to need to read the schema anyway, should we just use the pure fiona and no geopandas solution I mentioned at #1000 (comment)? Then we can remove the try-except block and the geopandas import?

I think we can't. Currently, we use the gpd.io.file.infer_schema function to infer the data schema, so we still have to import the geopandas package. More importantly, the gpd.io.file.infer_schema function only works for GeoPandas object.

Yeah, let's just stick to using gpd.io.file.infer_schema, I think we'll have to because the patch made at geopandas/geopandas#2950 is in geopandas only and not fiona.

weiji14 · 2023-08-19T04:14:06Z

Only write schema if schema["properties"] is not None. Should fix that error about ValueError: Record does not match collection schema: KeysView(<fiona.model.Properties object at 0x7f3265f03a10>) != []

Suggested change

ogrgmt_kwargs["schema"] = schema

if schema["properties"]:

ogrgmt_kwargs["schema"] = schema

I think it's not the root cause of the error.

Here is a modified version of the test_geopandas_info_geodataframe test to understand why it fails:

import geopandas as gpd import shapely linestring = shapely.geometry.LineString([(20, 15), (30, 15)]) polygon = shapely.geometry.Polygon([(20, 10), (23, 10), (23, 14), (20, 14)]) multipolygon = shapely.geometry.shape( { "type": "MultiPolygon", "coordinates": [ [ [[0, 0], [20, 0], [10, 20], [0, 0]], # Counter-clockwise [[3, 2], [10, 16], [17, 2], [3, 2]], # Clockwise ], [[[6, 4], [14, 4], [10, 12], [6, 4]]], # Counter-clockwise [[[25, 5], [30, 10], [35, 5], [25, 5]]], ], } ) # Multipolygon first so the OGR_GMT file has @GMULTIPOLYGON in the header gdf = gpd.GeoDataFrame( index=["multipolygon", "polygon", "linestring"], geometry=[multipolygon, polygon, linestring], ) schema = gpd.io.file.infer_schema(gdf) print(gdf) print(schema) gdf.to_file("test.gmt", driver="OGR_GMT", mode="w")

The output of the two print statements is:

geometry multipolygon MULTIPOLYGON (((0.00000 0.00000, 20.00000 0.00... polygon POLYGON ((20.00000 10.00000, 23.00000 10.00000... linestring LINESTRING (20.00000 15.00000, 30.00000 15.00000) {'geometry': ['MultiPolygon', 'Polygon', 'LineString'], 'properties': OrderedDict()}

To understand what geopandas/fiona does, I also added the same two print statements to source code geopandas/io/file.py at line https://github.com/geopandas/geopandas/blob/b4b10313ab57bf2c55592a28fb99687c9a538fc2/geopandas/io/file.py#L591.

The two print statements in the geopandas source code give:

index geometry 0 multipolygon MULTIPOLYGON (((0.00000 0.00000, 20.00000 0.00... 1 polygon POLYGON ((20.00000 10.00000, 23.00000 10.00000... 2 linestring LINESTRING (20.00000 15.00000, 30.00000 15.00000) {'geometry': ['MultiPolygon', 'Polygon', 'LineString'], 'properties': OrderedDict([('index', 'str')])}

As you can see, both the gdf object and the schema are different. This is because geopandas by defaults modifies the index column (see lines https://github.com/geopandas/geopandas/blob/b4b10313ab57bf2c55592a28fb99687c9a538fc2/geopandas/io/file.py#L551-L556).

In our case, our inferred schema has a empty property dict OrderedDict(), but geopandas/fiona expects a property dict like OrderedDict([('index', 'str')]).

https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_file.html#geopandas.GeoDataFrame.to_file

index: bool, default None

If True, write index into one or more columns (for MultiIndex). Default None writes the index into one or more columns only if the index is named, is a MultiIndex, or has a non-integer data type. If False, no index is written.

I think writing index into one column or not has no effects to the output of OGR_GMT driver, so maybe we should just set index=False.

I think writing index into one column or not has no effects to the output of OGR_GMT driver, so maybe we should just set index=False.

Setting index=True (the default) will produce this in the output *.gmt file:

# @VGMT1.0 # @R0/35/0/20 # @GMULTIPOLYGON # @Nindex # @Tstring # FEATURE_DATA > # @Dmultipolygon # @P 0 0 20 0 10 20 0 0 > # @H 3 2 10 16 17 2 3 2 > # @P 6 4 14 4 10 12 6 4 > # @P 25 5 30 10 35 5 25 5 > # @Dpolygon # @P 20 10 23 10 23 14 20 14 20 10 > # @Dlinestring 20 15 30 15

Notice the lines with the comments # @Nindex and # @Dmultipolygon which are from the index. So the 'index' is output as just a regular column in the OGR_GMT file, and could potentially be used by someone doing aspatial="Z=index". We probably do need to keep it just in case.

Is there another way to handle this? E.g. cast the 'index' column explicitly to a string type?

Is there another way to handle this? E.g. cast the 'index' column explicitly to a string type?

This won't always work, because the index property in a GeoDataFrame object can be integers. For example, in the above example, the object can be defined like

gdf = gpd.GeoDataFrame( #index=["multipolygon", "polygon", "linestring"], index=[0, 1, 2], geometry=[multipolygon, polygon, linestring], )

if index: df = df.reset_index(drop=False)

Probably just need these two lines, call reset_index if an index exists?

https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_file.html#geopandas.GeoDataFrame.to_file

index has a default value of None. So df.reset_index is only called if the GeoDataFrame object's index is named or is non-integer.

Ah sorry, I thought index was referring to the df.index attribute. We could do something like this to force an index name:

if gdf.index.name is None: gdf.index.name = "index" # print(gdf.index.names) # FrozenList(['index'])

It looks good.

Ah sorry, I thought index was referring to the df.index attribute. We could do something like this to force an index name:

if gdf.index.name is None: gdf.index.name = "index" # print(gdf.index.names) # FrozenList(['index'])

Patch applied in 2097add.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

geopandas: Mapping int/int64 to int32 for OGR_GMT format #2592

Diff view

Diff view

There are no files selected for viewing

weiji14 Jul 4, 2023

seisman Jul 4, 2023

weiji14 Jul 4, 2023

seisman Aug 9, 2023

weiji14 Aug 9, 2023

seisman Aug 11, 2023 •

edited

Loading

weiji14 Aug 11, 2023

seisman Aug 11, 2023

weiji14 Aug 11, 2023 •

edited

Loading

seisman Aug 14, 2023

weiji14 Jul 4, 2023

seisman Jul 4, 2023 •

edited

Loading

weiji14 Jul 4, 2023

seisman Jul 6, 2023

weiji14 Aug 11, 2023

weiji14 Jul 4, 2023

seisman Jul 4, 2023

seisman Aug 9, 2023

weiji14 Aug 9, 2023 •

edited

Loading

weiji14 Aug 19, 2023

seisman Aug 19, 2023

seisman Aug 19, 2023

weiji14 Aug 19, 2023

seisman Aug 19, 2023

weiji14 Aug 19, 2023 •

edited

Loading

seisman Aug 19, 2023

weiji14 Aug 19, 2023

seisman Aug 19, 2023

seisman Aug 20, 2023 •

edited

Loading

geopandas: Mapping int/int64 to int32 for OGR_GMT format #2592

geopandas: Mapping int/int64 to int32 for OGR_GMT format #2592

Diff view

Diff view

There are no files selected for viewing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seisman Aug 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weiji14 Aug 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seisman Jul 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weiji14 Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weiji14 Aug 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seisman Aug 20, 2023 • edited Loading

Choose a reason for hiding this comment

seisman Aug 11, 2023 •

edited

Loading

weiji14 Aug 11, 2023 •

edited

Loading

seisman Jul 4, 2023 •

edited

Loading

weiji14 Aug 9, 2023 •

edited

Loading

weiji14 Aug 19, 2023 •

edited

Loading

seisman Aug 20, 2023 •

edited

Loading