Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test data covering different native (geoarrow-based) encodings #204

Merged
merged 9 commits into from
May 28, 2024

Conversation

jorisvandenbossche
Copy link
Collaborator

@jorisvandenbossche jorisvandenbossche commented Apr 16, 2024

Addressing part of #199 (small example data)

And similar as #134, but focusing here on the new geoarrow-based encodings (I should also update that older PR)

This adds some tiny example data for the different geometry types:

  • for each I am saving the data as WKB and the native encoding (with some default metadata), and additionally also as plain text in case this would be useful for another tool as the source of truth to compare with (eg to check reading the parquet file was correct)
  • this scripts only relies on pyarrow for creating the data and writing the parquet files (and shapely to create the WKB data), so nothing geoarrow-specific (eg no arrow extension types, no interleaved coordinates, so just what is in the geoparquet spec)
  • for now I only added 2D data, not yet variants with z coordinates. And there are probably also more corner cases that could be added (eg a MultiPolygon containing an empty Polygon, and things like that?)

In #134, I also saved the metadata in a separate file as well, which allows easier validation / review that this is correct. I could do that here as well, in case that is useful.
(I should still add validation of the generated files with the json schema)

Does this look useful in this form? Any suggestions to approach it differently?

Copy link
Collaborator

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few notes! (I haven't checked the actual files yet)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like .csv files to me...should they have a .csv extension?

geometries_wkt = [
"POINT (30 10)",
"POINT EMPTY",
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there also be a version of these that contain NULLs or a version that contains Z values?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some null values!

Comment on lines 89 to 90
geometries = pa.array(
[[(30, 10), (10, 30), (40, 40)], []], type=linestring_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be less error-prone to construct these geometries from the WKT? Then we don't have to manually check that the WKT is equivalent to here.

Or maybe it would be better to use GeoJSON and pyarrow's JSON reader to infer geometry columns? Though then you'd need to convert from interleaved to separated coords, which is a bit involved for this test data here 😕

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I actually tried to automate writing this hard-coded data a bit, by parsing the WKT with shapely and converting to GeoJSON. The main problem is that then I need to convert the inner lists to tuples (as json is only lists), to have the construction as a separated struct type work with pyarrow (a tuple element can be converted to a struct, but a list element not). And that gives some convoluted logic ..

@jorisvandenbossche
Copy link
Collaborator Author

jorisvandenbossche commented Apr 17, 2024

cc @rouault would this (have been) useful for testing in GDAL? (xref #189 (comment))
(I have been checking the files generated here with GDAL main, and they read fine)

@rouault
Copy link
Contributor

rouault commented Apr 17, 2024

would this (have been) useful for testing in GDAL?

yes, thanks

Copy link
Contributor

@rouault rouault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've detected that there's a typo on geometry_type which should be geometry_types . Generated files should be regenerated

test_data/generate_test_data.py Outdated Show resolved Hide resolved
test_data/generate_test_data.py Show resolved Hide resolved
test_data/generate_test_data.py Outdated Show resolved Hide resolved
test_data/generate_test_data.py Show resolved Hide resolved
test_data/generate_test_data.py Show resolved Hide resolved
test_data/generate_test_data.py Show resolved Hide resolved
test_data/generate_test_data.py Show resolved Hide resolved
test_data/generate_test_data.py Show resolved Hide resolved
test_data/generate_test_data.py Show resolved Hide resolved
test_data/generate_test_data.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Collaborator Author

I've detected that there's a typo on geometry_type which should be geometry_types . Generated files should be regenerated

Ah yes, that's because I started from #134 which I wrote before we made that name change .. Thanks for the catch!

@cholmes cholmes added this to the 1.1 milestone May 2, 2024
@cholmes
Copy link
Member

cholmes commented May 28, 2024

I'm going to go ahead and merge this in, as it's been open and approved for awhile, and seems like a good edition. Any more work can be done as PR's to main.

@cholmes cholmes merged commit dced61c into opengeospatial:main May 28, 2024
2 checks passed
@jorisvandenbossche jorisvandenbossche deleted the example-data-geoarrow branch June 3, 2024 15:48
@tschaub
Copy link
Collaborator

tschaub commented Jun 21, 2024

@jorisvandenbossche - The scripts directory has a generate_example.py script and a readme that describes how to install the dependencies for it. Could we move this new generate_test_data.py script to that same directory and update the readme if needed to describe how to install the dependencies for it (or update the pyproject.toml or the poetry.lock as needed)?

In addition, it would be nice to pull the version identifier out of the format-specs/schema.json (as is done in the generate_example.py script) to limit the number of places we have to update version numbers for a release.

I'll leave the same comment on #232 (as I think it would be nice to consolidate the examples/test data into one thing that is easier to maintain).

@jorisvandenbossche
Copy link
Collaborator Author

In addition, it would be nice to pull the version identifier out of the format-specs/schema.json (as is done in the generate_example.py script) to limit the number of places we have to update version numbers for a release.

Yes, I realized last week with the release I suboptimally hardcoded the number here in this PR.

Will try to look at it tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants