Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2471: Add geometry logical type #240

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented May 10, 2024

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

@jiayuasu
Copy link
Member

@wgtmac Thanks for the work. On the other hand, I'd like to highlight that GeoParquet (https://github.com/opengeospatial/geoparquet/tree/main) has been there for a while and many geospatial software has started to support reading and writing it.

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

@jiayuasu
Copy link
Member

Geo Iceberg does not need to conform to GeoParquet because people should not directly use a parquet reader to read iceberg parquet files anyways. So that's a separate story.

@wgtmac
Copy link
Member Author

wgtmac commented May 11, 2024

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

@jiayuasu That's why I've asked the possibility of direct compliance to the GeoParquet spec in the Iceberg design doc. I don't intend to create a new spec. Instead, it would be good if the proposal here can meet the requirement of both Iceberg and GeoParquet, or share the common stuff to make the conversion between Iceberg Parquet and GeoParquet lightweight. We do need advice from the GeoParquet community to make it possible.

Copy link

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Iceberg side, I am excited about this, I think it will make Geospatial inter-op easier in the long run to define the type formally in parquet-format, and also unlock row group filtering. For example, Iceberg's add_file for parquet file. Perhaps there can be conversion utils for GeoParquet if we go ahead with this, and definitely like to see what they think.

Im new in parquet side, so had some questions

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
@wgtmac wgtmac marked this pull request as ready for review May 11, 2024 16:13
@wgtmac wgtmac changed the title WIP: Add geometry logical type PARQUET-2471: Add geometry logical type May 11, 2024
@pitrou
Copy link
Member

pitrou commented May 15, 2024

@paleolimbot is quite knowledgeable on the topic and could probably be give useful feedback.

@pitrou
Copy link
Member

pitrou commented May 15, 2024

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

In reading this I do wonder if there should just be an extension mechanism here instead of attempting to enumerate all possible encodings in this repo. The people that are engaged and working on implementations are the right people to engage here, which is why GeoParquet and GeoArrow have been successful (we've engaged the people who care about this, and they are generally not paying attention to apache/parquet-format nor apache/arrow).

There are a few things that this PR solves in a way that might not be possible using EXTENSION, which is that of column statistics. It would be nice to have some geo-specific things there (although maybe that can also be part of the extension mechanism). Another thing that comes up frequently is where to put a spatial index (rtree)...I don't think there's any good place for that at the moment.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata...this metadata is typically propagated through projections and the things we do in the GeoParquet standard (store bounding boxes, refer to columns by name) become stale with the ways that schema metadata are typically propagated through projections and concatenations.

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
@wgtmac
Copy link
Member Author

wgtmac commented May 17, 2024

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

@pitrou Yes, that might be an option. Then we can perhaps use the same json string defined in the iceberg doc. @jiayuasu @szehon-ho WDYT?

EDIT: I think we can remove those informative attributes like subtype, orientation, edges. Perhaps encoding can be removed as well if we only support WKB. dimension is something that we must be aware of because we need to build bbox which depends on whether the coordinate is represented as xy, xyz, xym and xyzm.

@wgtmac
Copy link
Member Author

wgtmac commented May 17, 2024

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata.

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge? @paleolimbot @jiayuasu

@paleolimbot
Copy link
Member

If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

The main reasons that the schema level metadata had to exist is because there was no way to put anything custom at the column level to give geometry-aware readers extra metadata about the column (CRS being the main one) and global column statistics (bbox). Bounding boxes at the feature level (worked around as a separate column) is the second somewhat ugly thing, which gives reasonable row group statistics for many things people might want to store. It seems like this PR would solve most of that.

I am not sure that a new logical type will catch on to the extent that GeoParquet will, although I'm new to this community and I may be very wrong. The GeoParquet working group is enthusiastic and encodings/strategies for storing/querying geospatial datasets in a data lake context are evolving rapidly. Even though it is a tiny bit of a hack, using extra columns and schema-level metadata to encode these things is very flexible and lets implementations be built on top of a number of underlying readers/underlying versions of the Parquet format.

@wgtmac
Copy link
Member Author

wgtmac commented May 18, 2024

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial. For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

@Kontinuation
Copy link
Member

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

The bounding-box based sort order defined for geometry logical type is already good enough for performing row-level and page-level data skipping. Spatial index such as R-tree may not be suitable for Parquet. I am aware that flatgeobuf has optional static packed Hilbert R-tree index, but for the index to be effective, flatgeobuf supports random access of records and does not support compression. The minimal granularity of reading data in Parquet files is data pages, and the pages are usually compressed so it is impossible to access records within pages randomly.

@paleolimbot
Copy link
Member

I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet.

I agree! I think first-class geometry support is great and I'm happy to help wherever I can. I see GeoParquet as a way for existing spatial libraries to leverage Parquet and is not well-suited to Parquet-native things like Iceberg (although others working on GeoParquet may have a different view).

Extension mechanisms are nice because they allow an external community to hash out the discipline-specific details where these evolve at an orthogonal rate to that of the format (e.g., GeoParquet), which generally results in buy-in. I'm not familiar with the speed at which the changes proposed here can evolve (or how long it generally takes readers to implement them), but if @pitrou's suggestion of encoding the type information or statistics in serialized form makes it easier for this to evolve it could provide some of that benefit.

Spatial index such as R-tree may not be suitable for Parquet

I also agree here (but it did come up a lot of times in the discussions around GeoParquet). I think developers of Parquet-native workflows are well aware that there are better formats for random access.

@paleolimbot
Copy link
Member

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

I opened up opengeospatial/geoparquet#222 to collect some thoughts on this...we discussed it at our community call and I think we mostly just never considered that the Parquet standard would be interested in supporting a first-class data type. I've put my thoughts there but I'll let others add their own opinions.

@jorisvandenbossche
Copy link
Member

Just to ensure my understanding is correct:

  • This is proposing to add a new logical type annotating the BYTE_ARRAY physical type. For readers that expect just such a BYTE_ARRAY column (e.g. existing GeoParquet implementations), that is compatible if the column would start having a logical type as well? (although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type).
  • For such "legacy" readers (just reading the WKB values from a binary column), the only thing that actually changes (apart from the logical type annotation) are the values of the statistics? Now, I assume that right now no GeoParquet reader is using the statistics of the binary column, because the physical statistics for BYTE_ARRAY ("unsigned byte-wise comparison") are essentially useless in the case those binary blobs represent WKB geometries. So again that should probably not give any compatibility issues?

@jorisvandenbossche
Copy link
Member

although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type

To answer this part myself, at least for the Parquet C++ implementation, it seems an error is raised for unknown logical types, and it doesn't fall back to the physical type. So that does complicate the compatibility story ..

@wgtmac
Copy link
Member Author

wgtmac commented May 21, 2024

@jorisvandenbossche I think your concern makes sense. It should be a bug if parquet-cpp fails due to unknown logical type and we need to fix that. I also have concern about a new ColumnOrder and need to do some testing. Adding a new logical type should not break anything from legacy readers.

@jornfranke
Copy link

jornfranke commented May 21, 2024

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586

@szehon-ho
Copy link

On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586

Yes there is now a concrete proposal apache/iceberg#10260 , and the plan currently is to bring it up in next community sync

@cholmes
Copy link

cholmes commented May 23, 2024

Thanks for doing this @wgtmac - it's awesome to see this proposal! I helped initiate GeoParquet, and hope we can fully support your effort.

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial.

That makes sense, but I think we're also happy to have GeoParquet replaced! As long as it can 'scale up' to meet all the crazy things that hard core geospatial people need, while also being accessible to everyone else. If Parquet had geospatial types from the start we wouldn't have started GeoParquet. We spent a lot of time and effort trying to get the right balance between making it easy to implement for those who don't care about the complexity of geospatial (edges, coordinate reference systems, epochs, winding), while also having the right options to handle it for those who do. My hope has been that the decisions we made there will make it easier to add geospatial support to any new format - like that a 'geo-ORC' could use the same fields and options that we added.

For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

Sounds great! Happy to have GeoParquet be a place to 'try out' things. But I think ideally the surface area of 'GeoParquet' would be very minimal or not even exist, and that Parquet would just be the ideal format to store geospatial data in. And I think if we can align well between this proposal and GeoParquet that should be possible.

* Coordinate Reference System, i.e. mapping of how coordinates refer to
* precise locations on earth, e.g. OGC:CRS84
*/
3: optional string crs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paleolimbot @wgtmac @Kontinuation @zhangfengcdt Please see the latest comment from @desruisseaux on the Iceberg Geometry PR (apache/iceberg#10260). I also discussed this issue with the GeoParquet community today.

In short, some people like projjson given its popularity and completeness, some people don't like it because it is still incomplete and there is another on-going effort at OGC CRS working group called CRSJSON to replace it, some other people like SRID (comments from Snowflake).

We'd better define what CRS encoding is used in this optional CRS string by using an additional enum field to avoid future confusion.

Example:

5. optional enum CRSEncoding {
  WKT2 = 0;
  SRID = 1; // Format: AUTHORITY:CODE
  PROJJSON = 2;
  // Future work when CRSJSON is completed: add CRSJSON = 3
}

Some members from the GeoParquet community think allowing multiple encoding of CRS introduce additional overhead to the implementer. I would argue that this is not a problem because:

  1. The reader / writer does not need to support all CRS Encodings. It is optional anyway.
  2. Even if we use projjson, a large number of engines in the Java world (Hive, Trino, Presto, Sedona/Spark, Sedona/Flink) cannot comprehend it because no Java library for it.
  3. The CRS encoding is purely a description field. If an engine does not understand it, just carry it along the way.

In addition, the next GeoParquet community sync is August 12. Please join us for the discussion of this topic: Zoom meeting: opengeospatial/geoparquet#240

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example:

I know this is just an example, but can we make this a string to avoid a Thrift update when a new CRS encoding arrives?

WKT2 = 0;

If this is included as an option it would need to be more explicit about what kind of WKT2 we're talking about (I think we'd mean WKT2 2019): https://github.com/OSGeo/PROJ/blob/79b4f28c10d1695da841ca33d6f14fced2a2979a/src/proj.h#L793

introduce additional overhead to the implementer

It is true that the implementer typically only handles translating the coordinates into some native library representation (e.g., JTS, GEOS) or performing computation based on them. I think the main thing here is to make it explicit exactly how to resolve the projection parameters given a CRS representation...with WKT and PROJJSON they are embedded (i.e., an implementor of a coordinate transform does not have to resolve anything from a database to translate between another CRS on the same datum...like long/lat to a mercator projection). With SRID, one would have to make it clear how to actually resolve those (EPSG or PROJ database version, URI of a lookup table of some kind, etc.).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using string type if the encoding is just informative.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like strong opinions on all fronts. +1 for adding the crs type and for it to be a string, in my thought the storage should not be too opinionated. We can document common values in the docs maybe?

Copy link
Member

@jiayuasu jiayuasu Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac According to the feedback from the Iceberg thread, I suggest we also add a string field namely crs_kind in addition to the crs field. The only allowed value currently is PROJJSON. In the future, if there is a new OGC standard called CRSJSON that differs from PROJJSON, we will allow another value CRSJSON.

For WKT2 2019 <-> PROJJSON, we will implement a Java version of this library https://github.com/rouault/projjson_to_wkt so whoever wants to use WKT2 2019 CRS can use it to get it from the projjson string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like strong opinions on all fronts.

Definitely 🙂

I suggest we also add a string field namely crs_kind in addition to the crs field.

I think this is a great idea. In this PR I think the scope should be how to add the Geometry type in a way that allows the geospatial community to have these discussions in a way that does not force Thrift changes and/or changes in Parquet implementations themselves. We can continue to debate the allowed values of crs_kind (and I imagine we will for some time as we accumulate use cases).

Isn't a bit weird that the only allowed value is the non-standard encoding

There are a few threads in the GeoParquet repo where it was discussed...I think the idea was that it is structurally identical to WKT2 2019 but can be inspected with access to a JSON parser (which exists almost everywhere). This would allow (for example) a library implementing a computation to error for a Geographic CRS if it didn't apply, or to extract the authority and code without a WKT parser (WKT does not exist outside the CRS world as far as I know). This is a very good fit for something like (Geo)Parquet, where we are trying to ensure that those who care can express complex geospatial concepts without forcing Parquet implementations or related code to link to geo-specific libraries.

This is not needed as latest WKT 2 is backward compatible with the previous version

In the case that it is allowed, it is probably a good idea to communicate that with a reference both standards 🙂

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiayuasu quick question to try to follow, what do we put for crs_kind when crs="OGC:CRS84". It is empty ?

Copy link
Member

@jiayuasu jiayuasu Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho According to GeoParquet spec (https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md), when I say hard-coded to OGC:CRS84, the crs field actually stores this JSON value:

{
    "$schema": "https://proj.org/schemas/v0.5/projjson.schema.json",
    "type": "GeographicCRS",
    "name": "WGS 84 longitude-latitude",
    "datum": {
        "type": "GeodeticReferenceFrame",
        "name": "World Geodetic System 1984",
        "ellipsoid": {
            "name": "WGS 84",
            "semi_major_axis": 6378137,
            "inverse_flattening": 298.257223563
        }
    },
    "coordinate_system": {
        "subtype": "ellipsoidal",
        "axis": [
        {
            "name": "Geodetic longitude",
            "abbreviation": "Lon",
            "direction": "east",
            "unit": "degree"
        },
        {
            "name": "Geodetic latitude",
            "abbreviation": "Lat",
            "direction": "north",
            "unit": "degree"
        }
        ]
    },
    "id": {
        "authority": "OGC",
        "code": "CRS84"
    }
}

And the crs_kind field is PROJJSON.

Both of crs and crs_kind field are optional but they both need to present together, if one needs to store CRS info

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we put for crs_kind when crs="OGC:CRS84". It is empty ?

+1 to all @jiayuasu said, and just to be totally clear - it would often be empty. Including the CRS and kind/encoding in this case is more 'informative' - implementations should understand that if they see crs="OGC:CRS84" then they don't need to check the crs and kind/encoding, and if the values differ then they should use CRS84 and ignore the provided CRS. We should provide the definition of OGC:CRS84 in all possible encodings in a link from the core definition - WKT1, WKT2, PROJJSON, etc. so that any projection aware library is sure to get the exact right definition. The goal for GeoParquet was to make it so that implementations that only want to support long / lat can without having to parse / worry about anything else, since it's a lot of complexity and requires some sort of geo library to parse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the above example to the crs and crs_encoding fields. Please check @jiayuasu @cholmes

@desruisseaux
Copy link

Also, I think the documentation needs to be expanded:

  • "precise locations on earth" depends on the CRS. When using OGC:CRS84 or EPSG:4326, the location precision is approximately 2 meters.
  • When using SRID, axis order shall be as defined by the authority. This means that EPSG:4326 is (latitude, longitude). If a different order is desired, this is perfectly fine but it shall not be called EPSG:4326.

@paleolimbot
Copy link
Member

When using SRID, axis order shall be as defined by the authority

In GeoParquet (and GeoPackage), there is language that makes it explicit that the first axis is always an easting or longitude even in the case where the authority does not define it in this way.

https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#coordinate-axis-order

https://www.geopackage.org/spec130/#gpb_spec

This is not to invalidate the importance of respecting the axis order in geospatial-specific formats; however, here its use would break quite a lot of existing uses of "srid" (should we choose to allow that here).

@desruisseaux
Copy link

In GeoParquet (and GeoPackage), there is language that makes it explicit that the first axis is always an easting or longitude even in the case where the authority does not define it in this way.

GeoPackage itself replicated the GeoTIFF precedent. Those two standards were originally developed outside OGC, and we have difficulties in conciliating the community practice with being unambiguous. A language mandating (easting, northing) order does not work at the North pole where all directions are oriented toward south. It is ambiguous with map projections having axes such as (west, north) or (east, south): should we also revert the sign of coordinate values? And it does not work at all with CRS having aft, port, starboard, clockwise, counter-clockwise, toward, away-from directions such as vehicles (all those directions are part of ISO 19111 standard).

The ISO 19107:2019 — Spatial schema standard (the abstract model for geometries from which Simple Feature is a simplified subset) addresses this problem with a Permutation class which contains a outOrder field. Can we do something along the line of that standard in GeoParquet? When SRID EPSG:4326 is used, a "Permutation" field should contain the {1, 0} array of integers (or {2, 1} if we start counting at 1) for explicitly requesting a change of axis order from (latitude, longitude) to (longitude, latitude).

Reference: ISO 19107:2019 §6.2.8.6

@paleolimbot
Copy link
Member

This is a great discussion for the GeoParquet repo (where more subject experts can weigh in): https://github.com/opengeospatial/geoparquet/issues . In general (with GeoParquet and its evolution here), we are seeking to standardize how existing stakeholders are storing geospatial data in Parquet files. In the absence of real-world examples where this is happening already, I would suggest that Permutation be a future consideration where desired by stakeholders (but the GeoParquet repo is the right place for the discussion!)

@desruisseaux
Copy link

In the absence of real-world examples where this is happening already (…snip…)

For what it is worth, the use of CRS beyond (east, north) axes is actively explored in OGC Testbeds for a few years: Non-Terrestrial Geospatial engineering report (ER), Extraterrestrial GeoTIFF ER, 3D+ Data Space Object ER and more. There is really a demand for that (e.g. for referencing images taken by drones), and the apparent lack of examples is because peoples don't know how to do that with current software. This is a chicken-and-egg problem and is why those OGC Testbeds exist.

However I agree that it should be a discussion for GeoParquet. I mentioned those points because I though that the above proposed CRS string was specific to this Parquet project.

@wgtmac
Copy link
Member Author

wgtmac commented Aug 6, 2024

Tried to catch up with the latest discussion. To incorporate new covering and crs_encoding, I'd propose following changes:

  1. crs_encoding has been added to GeometryType. (Do we need to make both crs and crs_encoding required?)
struct GeometryType {
  ...
  
  /**
   * Coordinate Reference System, i.e. mapping of how coordinates refer to
   * precise locations on earth.
   */
  3: optional string crs;
  /**
   * Encoding used in the above crs field.
   * Currently the only allowed value is "PROJJSON".
   */
  4: optional string crs_encoding;
  
  ...
}
  1. New Covering has been adopted and uses binary type for the value to be more flexible.
struct Covering {
  /** 
   * A type of covering. Currently accepted values: "WKB".
   */
  1: required string kind;
  /** A payload specific to kind:
   * - WKB: well-known binary of a POLYGON that completely covers the contents.
   *   This will be interpreted according to the same CRS and edges defined by
   *   the logical type.
   */
  2: required binary value;
}
  1. GeometryStatistics uses a list of Covering support more than one covering.
struct GeometryStatistics {
  ...

  /** A list of coverings of geometries */
  2: optional list<Covering> coverings;

  ...
}

@jiayuasu @paleolimbot Please let me know if these changes are appropriate. I will update the PR if it reaches consensus.

EDIT:

  • rename crs_kind to crs_encoding as suggested by @desruisseaux
  • Covering/kind now only supports WKB.

@desruisseaux
Copy link

Note: crs_kind is maybe not a good name. "Kind" sounds close to "type" to me, so CRS kind would be "Projected CRS", "Geographic CRS", etc., and those kinds are the same in all encodings (JSON, WKT, etc.). I think that crs_encoding describes better the intent of this field.

@jiayuasu
Copy link
Member

jiayuasu commented Aug 6, 2024

@wgtmac

  1. We should remove S2 / H3 from the covering column because it is still unclear about how to create statistics for them.
  2. We should specify the ordering of coordinates in the bbox covering. It should follow the similar definition in GeoParquet: xmin, ymin, xmax, ymax.

@paleolimbot
Copy link
Member

Thank you for summarizing that! Along with what @jiayuasu mentioned, it should be clear how this is going to get encoded as binary (JSON string? Little-endian doubles with specific ordering?).

I think we don't need both WKT and WKB (just WKB would be my preference!)

@wgtmac
Copy link
Member Author

wgtmac commented Aug 6, 2024

We should remove S2 / H3 from the covering column because it is still unclear about how to create statistics for them.

Make sense. Let me remove it.

We should specify the ordering of coordinates in the bbox covering.

Did you mean removing the struct BoundingBox and using the struct Covering to represent the bbox: https://github.com/apache/parquet-format/pull/240/files#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR268?? In this way, the WKB encoding will handle the coordinates, right? @jiayuasu

it should be clear how this is going to get encoded as binary (JSON string? Little-endian doubles with specific ordering?).

IIUC, the Covering/kind field has already specified the way to encode the data. For example, WKB for bbox. @paleolimbot

I think we don't need both WKT and WKB (just WKB would be my preference!)

Yes, WKB is always preferred in this case.

@jiayuasu
Copy link
Member

jiayuasu commented Aug 6, 2024

@wgtmac

We should specify the ordering of coordinates in the bbox covering.

I retracted what I said. The geometry used in the covering is not necessarily a bbox. In fact, it is a polygon.

For the WKB encoding of the geometry in the covering statistics and the WKB encoding in each geometry field, we should be clear about the encoding of WKB. See the difference of WKB.

Maybe we should use Standard WKB (no SRID, Z, M value) or ISO WKB (with Z, M value but no SRID). Sedona has been using Big Endian in its WKB encoding. Do you think Big Endian make sense? @paleolimbot

@paleolimbot
Copy link
Member

IIUC, the Covering/kind field has already specified the way to encode the data. For example, WKB for bbox

I am a little confused about the bbox in the covering here...a WKB polygon is not sufficient to encode a bbox in more than two dimensions (although the kind/value proposal is sufficiently flexible that we could evolve that without a thrift change should the use case emerge).

Maybe we should use Standard WKB (no SRID, Z, M value) or ISO WKB (with Z, M value but no SRID)

GeoParquet specifies ISO WKB (as a SHOULD) and I think this is what should be used everywhere in this spec.

Do you think Big Endian make sense?

I always use little endian because big endian forces looping + swapping endian for each 8 bytes of every ordinate on virtually all modern hardware (unless I am mistaken!). (Neither here nor there for the spec, though, since WKB has the endianness specified in the first byte).

@jiayuasu
Copy link
Member

jiayuasu commented Aug 7, 2024

@paleolimbot Sounds good. I think the proposal has addressed the WKB issue.

On the other hand, my understanding is that the WKB Geometry used in the covering statistics is not a bbox. And This will be interpreted according to the same CRS and edges defined by the logical type.

@paleolimbot
Copy link
Member

On the other hand, my understanding is that the WKB Geometry used in the covering statistics is not a bbox.

It's not necessarily one; however, the cheapest possible way to build a bbox covering would be to build a bbox from the input and convert it to a POLYGON. Similarly, the cheapest possible way to push a filter down into the WKB-encoded covering would be to compute the bbox of the WKB-encoded covering and use that. In the Java implementation it seems like you can lean on JTS for all of this (in C++ we probably don't want to invoke any actual geometry libraries).

@jiayuasu
Copy link
Member

jiayuasu commented Aug 7, 2024

@paleolimbot yes, I agree. So is it better to add in the spec that: Currently, the covering column is only safe to perform pruning on 2 dimensional geometries. Or, is it better to actually use bbox in the covering instead of WKB polygon? Either way, is it redundant to even have this covering statistics given we already have the bbox statistics separately?

@wgtmac
Copy link
Member Author

wgtmac commented Aug 7, 2024

The covering stats was introduced to be a vendor-agnostic geometry (e.g. S2/H3) and be able to use a covering polygon when bbox is unavailable. If we can replace the bbox stats by adding a new Covering/kind for bbox, I'm inclined to remove the explicit bbox stats, though we have to design a binary-encoding for bbox.

@paleolimbot
Copy link
Member

I like the latest change! (i.e., keep the bbox in thrift, reduce the future coverings options to just WKB for now). The bounding box will always be applicable and it makes sense to keep it in thrift.

/**
* Allowed for physical type: BYTE_ARRAY.
*
* Well-known binary (WKB) representations of geometries. It supports 2D or
Copy link
Member

@jorisvandenbossche jorisvandenbossche Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we specify here that this should be ISO WKB (and not "extended WKB", using the terminology of https://libgeos.org/specifications/wkb/, which only matters for >2 D geometries)

The text in the geoparquet spec about this: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#wkb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was mentioned in this spec:

This is borrowed from `geometry_types` column metadata of GeoParquet [1]
   * except that values in the list are WKB (ISO variant) integer codes [2]. Table
   * below shows the most common geometry types and their codes:

@jorisvandenbossche
Copy link
Member

Another quick note: we should mention something about the coordinate axis order being x/y (lon/lat, easting/northing). The text from geoparquet is here: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#coordinate-axis-order

* This encoding enables GeometryStatistics to be set in the column chunk
* and page index.
*/
WKB = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we specify between little-endian and big-endian. The WKB specification terminology describes this as NDR (little-endian) or XDR (big-endian). Im trying to come up with Iceberg spec and realize we didnt spec this here. Im thinking NDR , unless we want to make it configurable

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jiayuasu @paleolimbot sorry if already covered in existing discussion that I missed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho This was discussed above. WKB has the endianness specified in the first byte so both are allowed and readers will be able to identify and read it. So we don't need to specify that ourselves in Parquet and Iceberg spec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet