-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2471: Add GEOMETRY and GEOGRAPHY logical types #240
base: master
Are you sure you want to change the base?
Conversation
@wgtmac Thanks for the work. On the other hand, I'd like to highlight that GeoParquet (https://github.com/opengeospatial/geoparquet/tree/main) has been there for a while and many geospatial software has started to support reading and writing it. Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet? |
Geo Iceberg does not need to conform to GeoParquet because people should not directly use a parquet reader to read iceberg parquet files anyways. So that's a separate story. |
@jiayuasu That's why I've asked the possibility of direct compliance to the GeoParquet spec in the Iceberg design doc. I don't intend to create a new spec. Instead, it would be good if the proposal here can meet the requirement of both Iceberg and GeoParquet, or share the common stuff to make the conversion between Iceberg Parquet and GeoParquet lightweight. We do need advice from the GeoParquet community to make it possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From Iceberg side, I am excited about this, I think it will make Geospatial inter-op easier in the long run to define the type formally in parquet-format, and also unlock row group filtering. For example, Iceberg's add_file for parquet file. Perhaps there can be conversion utils for GeoParquet if we go ahead with this, and definitely like to see what they think.
Im new in parquet side, so had some questions
@paleolimbot is quite knowledgeable on the topic and could probably be give useful feedback. |
I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string. Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information. Edit: this seems to be the approach adopted by GeoParquet as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.
In reading this I do wonder if there should just be an extension mechanism here instead of attempting to enumerate all possible encodings in this repo. The people that are engaged and working on implementations are the right people to engage here, which is why GeoParquet and GeoArrow have been successful (we've engaged the people who care about this, and they are generally not paying attention to apache/parquet-format nor apache/arrow).
There are a few things that this PR solves in a way that might not be possible using EXTENSION, which is that of column statistics. It would be nice to have some geo-specific things there (although maybe that can also be part of the extension mechanism). Another thing that comes up frequently is where to put a spatial index (rtree)...I don't think there's any good place for that at the moment.
It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata...this metadata is typically propagated through projections and the things we do in the GeoParquet standard (store bounding boxes, refer to columns by name) become stale with the ways that schema metadata are typically propagated through projections and concatenations.
@pitrou Yes, that might be an option. Then we can perhaps use the same json string defined in the iceberg doc. @jiayuasu @szehon-ho WDYT? EDIT: I think we can remove those informative attributes like |
I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.
I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge? @paleolimbot @jiayuasu |
The main reasons that the schema level metadata had to exist is because there was no way to put anything custom at the column level to give geometry-aware readers extra metadata about the column (CRS being the main one) and global column statistics (bbox). Bounding boxes at the feature level (worked around as a separate column) is the second somewhat ugly thing, which gives reasonable row group statistics for many things people might want to store. It seems like this PR would solve most of that. I am not sure that a new logical type will catch on to the extent that GeoParquet will, although I'm new to this community and I may be very wrong. The GeoParquet working group is enthusiastic and encodings/strategies for storing/querying geospatial datasets in a data lake context are evolving rapidly. Even though it is a tiny bit of a hack, using extra columns and schema-level metadata to encode these things is very flexible and lets implementations be built on top of a number of underlying readers/underlying versions of the Parquet format. |
@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial. For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet. |
The bounding-box based sort order defined for geometry logical type is already good enough for performing row-level and page-level data skipping. Spatial index such as R-tree may not be suitable for Parquet. I am aware that flatgeobuf has optional static packed Hilbert R-tree index, but for the index to be effective, flatgeobuf supports random access of records and does not support compression. The minimal granularity of reading data in Parquet files is data pages, and the pages are usually compressed so it is impossible to access records within pages randomly. |
I agree! I think first-class geometry support is great and I'm happy to help wherever I can. I see GeoParquet as a way for existing spatial libraries to leverage Parquet and is not well-suited to Parquet-native things like Iceberg (although others working on GeoParquet may have a different view). Extension mechanisms are nice because they allow an external community to hash out the discipline-specific details where these evolve at an orthogonal rate to that of the format (e.g., GeoParquet), which generally results in buy-in. I'm not familiar with the speed at which the changes proposed here can evolve (or how long it generally takes readers to implement them), but if @pitrou's suggestion of encoding the type information or statistics in serialized form makes it easier for this to evolve it could provide some of that benefit.
I also agree here (but it did come up a lot of times in the discussions around GeoParquet). I think developers of Parquet-native workflows are well aware that there are better formats for random access. |
I opened up opengeospatial/geoparquet#222 to collect some thoughts on this...we discussed it at our community call and I think we mostly just never considered that the Parquet standard would be interested in supporting a first-class data type. I've put my thoughts there but I'll let others add their own opinions. |
Just to ensure my understanding is correct:
|
To answer this part myself, at least for the Parquet C++ implementation, it seems an error is raised for unknown logical types, and it doesn't fall back to the physical type. So that does complicate the compatibility story .. |
@jorisvandenbossche I think your concern makes sense. It should be a bug if parquet-cpp fails due to unknown logical type and we need to fix that. I also have concern about a new |
On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586 |
Yes there is now a concrete proposal apache/iceberg#10260 , and the plan currently is to bring it up in next community sync |
Thanks for doing this @wgtmac - it's awesome to see this proposal! I helped initiate GeoParquet, and hope we can fully support your effort.
That makes sense, but I think we're also happy to have GeoParquet replaced! As long as it can 'scale up' to meet all the crazy things that hard core geospatial people need, while also being accessible to everyone else. If Parquet had geospatial types from the start we wouldn't have started GeoParquet. We spent a lot of time and effort trying to get the right balance between making it easy to implement for those who don't care about the complexity of geospatial (edges, coordinate reference systems, epochs, winding), while also having the right options to handle it for those who do. My hope has been that the decisions we made there will make it easier to add geospatial support to any new format - like that a 'geo-ORC' could use the same fields and options that we added.
Sounds great! Happy to have GeoParquet be a place to 'try out' things. But I think ideally the surface area of 'GeoParquet' would be very minimal or not even exist, and that Parquet would just be the ideal format to store geospatial data in. And I think if we can align well between this proposal and GeoParquet that should be possible. |
@wgtmac separate from the spec it might be good to start discussions on what the implementation for GeoParquet might look like (e.g. what new dependencies do we plan on taking on for reference implementation? What would APIs look like?) |
@emkornfield The PoC implementations are apache/arrow#43977 and apache/parquet-java#2971 |
@emkornfield I think this is a good idea! The PoC implementations specifically may not handle writing statistics for non-planar edges depending on the final call on whether the statistics are always Cartesian min/max (i.e., lying for spherical edges and should be ignored), or whether the statistics take into curved edges for the non-planar case (requires non-trivial computational effort and complexity on behalf of the writer, but eliminates computational effort and complexity for the reader). Discussions in Iceberg have converged on the latter, which means we may have to figure out how to plug in S2 and/or Boost::Geometry when writing statistics in C++ (I can't speak for Java). Off the top of my head it could either be a Parquet-specific hook to override stats for a column chunk, the name of an Arrow compute UDF that can compute the required box, or willingness to put Boost or s2 as a dependency in that section of the code. (I don't think that's required for PoC , personally, but I'm also happy to prototype any of those if somebody does). |
S2 C++ has S2LatLngRectBounder class which makes it very easy to compute the rectangular bound of a chain of geodesic edges, taking their curvature into account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating this! I know we have collectively been back and forth on a number of these concepts but I think what this has reduced to is very good. I took a read for consistency with the Iceberg PR and anything that might conflict when Parquet readers interact with spatial type implementations elsewhere. It looks great!
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
LogicalTypes.md
Outdated
- `crs`: An optional string value for CRS. If unset, the CRS defaults to | ||
"OGC:CRS84", which means that the geometries must be stored in longitude, | ||
latitude based on the WGS84 datum. | ||
- `crs_encoding`: An optional enum value to describes the encoding used by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Iceberg PR, the crs
and crs_encoding
are merged to a single crs
field and formatted as $type:$content
, where type
allows srid
and projjson
. Should we follow the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only thing that I am hesitant to keep it consistent with the Iceberg PR. The mapping between them follows the table below:
Iceberg | Parquet |
---|---|
$type |
crs_encoding |
$content |
crs |
The reason is to provide a chance for non-Iceberg use cases to write full projjson content to the crs
field. But this is not a strong opinion. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy with either (they encode equivalent information); however, I think separating them specifically for Parquet is a little nicer. I do like that there is a clear way for a writer to write an arbitrary string if it doesn't have the means to validate that it received PROJJSON (or doesn't have the ability to generate it). I also don't feel strongly about that 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! I think the current way in Parquet is better and we can keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm strongly against the current type parameters. The CRS should not be embedded, it should be referenced like the spec in Iceberg. There's no need to duplicate these, require support for specific formats, or make type annotations huge. A CRS reference ID is cleaner, along with a way to store the CRS definition in file metadata if that's desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm neutral conceptually (it's all the same information, it's just a matter of where to put it). The decision to put it as part of the LogicalType struct just makes a cleaner implementation (i.e., no need to add a reference to the global metadata to existing function signatures) to support conversions that need the CRS (almost all of them, as far as I know).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the crs
field to be a simple string without restriction. Let me know what you think. @rdblue @paleolimbot
src/main/thrift/parquet.thrift
Outdated
@@ -386,6 +409,61 @@ struct BsonType { | |||
struct VariantType { | |||
} | |||
|
|||
/** Coordinate reference system (CRS) encoding for Geometry and Geography logical types */ | |||
enum CRSEncoding { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: looks like this is not used.
Geospatial.md
Outdated
latitude based on the WGS84 datum. | ||
|
||
Custom CRS can be specified by a string value. It is recommended to use the | ||
identifier of the CRS like [Spatial reference identifier][srid] and [PROJJSON][projjson]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is PROJJSON considered an identifier? I think it may be more clear if the reference to PROJJSON here were moved and clarified elsewhere as a convention for how you might pass a CRS definition as PROJJSON in a table or file property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked that PROJJSON has identifiers: https://proj.org/en/stable/specifications/projjson.html#identifiers. However I'm not an expert so perhaps @jiayuasu @paleolimbot could help answer it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, PROJJSON optionally embeds an identifier in its JSON structure if the CRS has one (however, some of the data we are trying to convince large organizations/governments to distribute in Parquet don't have an authority/code and some require more than one authority/code to specify the CRS for the x-y separately from the z).
Because we've gone in quite a few circles on this one, my preference is just a string representation of the CRS with no further specification (i.e., writer/reader is responsible for serializing and deserializing the CRS, respectively).
If that isn't acceptable, I would add "writers should write the most compact form of CRS that fully describes the CRS. Identifiers should be used where possible and written in the form authority:code (e.g., OGC:CRS84
to specify longitude/latitude on the WGS84 ellipsoid)." That definition would result in 99.9% of geometry columns having a compact (but self-contained) CRS definition (authority:code), while also allowing producers to write whatever an upstream library provided them.
Barring either of those being acceptable, I would just make the projjson:some_schema_metadata_field
language explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we've gone in quite a few circles on this one, my preference is just a string representation of the CRS with no further specification (i.e., writer/reader is responsible for serializing and deserializing the CRS, respectively).
I think this is exactly the current goal with the recommendation (not enforcement) of an ID-based CRS.
Thanks for all the work here, @wgtmac and everyone that has helped review and refine this! It looks ready to me and I'm glad to see that it is simpler. I agree with @paleolimbot that what this has now reduced down to looks great. |
Geospatial.md
Outdated
|
||
## Bounding Box | ||
|
||
A geometry has at least two coordinate dimensions: X and Y for 2D coordinates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: 'geometry' may be misleading as its both geometry and geography. how about 'A bounding box value has at least...'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A related question: should we rename the GeometryStatistics
to GeospatialStatistics
to avoid confusion?
Geospatial.md
Outdated
The default CRS `OGC:CRS84` means that the objects must be stored in longitude, | ||
latitude based on the WGS84 datum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is explicitly mentioned here that this case means lon/lat. But does that imply that for other CRS values the stored coordinates follow the axis order defined by the CRS? And not always as x/y or lon/lat as GeoParquet specifies?
(in either case, I think it would be good to be more specific about this)
Updated the PR to address various comments:
Let me know what you think. @rdblue @paleolimbot @jorisvandenbossche @szehon-ho @jiayuasu |
Has this been discussed the last months to change this? This is a huge break in compatibility with GeoParquet, and I am not sure that is going to be practical, both for a transitional phase (for example it makes it more difficult to write parquet files that both use this new geometry type and still are valid GeoParquet files as well) as for readers/writers long term (AFAIK most libraries/engines that would consume such data would have to swap the coordinates then, and additionally also always have to inspect the details of the CRS while consuming the WKB). |
There was a discussion with @rdblue @jiayuasu @szehon-ho to not want people assuming that X is always longitude. For the default CRS |
Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.