Make the Metadata Optional? #169
Replies: 7 comments 18 replies
-
I really like the idea of having sensible defaults and avoiding the "geo" file metadata altogether. There is interest in supporting different geometry encodings, and there may be concern about "locking in" WKB as the default binary encoding. I know you may be holding off on suggesting this here, but one option would be to say that if the geometry column is a string type, then the default encoding is WKT. Perhaps the initial version of the spec would not describe support for a binary encoding. That could leave more time to discover what might make for an optimal binary encoding, and a future version of the spec could describe the default in cases where the geometry column is binary without breaking compatibility. |
Beta Was this translation helpful? Give feedback.
-
Provided I'm thinking of this correctly, this conflates having sensible defaults when writing, with lack of support for reading metadata, which may ultimately harm the goals here. It is one thing to allow reading a GeoParquet file that has no metadata, in which case sensible defaults make sense. We tried to achieve some degree of consensus around sensible defaults and a relative minimum of absolutely required fields, and we can certainly try to figure out how to reduce those further while preserving the overall goals. It is another thing entirely to allow reading a GeoParquet file that has metadata but reading of metadata is not supported by a particular engine. That means that as soon as we create one that does not follow the defaults, we risk misinterpretation by the reader, potentially with bad results (e.g., projected CRS). We elaborated a bit on other issues that there are use cases where it is important to not force reprojection to the default CRS as that harms internal workflows (which may be between separate engines, e.g., Python and R), even if we have guidelines that encourage (but not require) using the default CRS when sharing publicly. Meaning: there are and continue to be use cases where we need the metadata in place to handle reading these correctly. I am of course assuming that if we're talking about supporting writes by engines that do not support parquet metadata that we're also talking about supporting reads by those same engines. Maybe that's not a thing? Some of our earlier thinking around this is that GeoParquet is a portable file format between different engines that have their own internal geometry support, but maybe now we're also talking about use cases where incoming geometry data may be simply stored and later retrieved (like other attributes) and not used directly; i.e., I/O with engines that do not themselves deal in geometries. The idea of producers that are not themselves consumers is an interesting one for us to dig more into. It would be helpful to know more about this use case, especially around the read side of things. Could we be strict and say that all readers must have the capability to read GeoParquet metadata at least so far as to determine that no metadata is present (in which case, assume the defaults)? But if GeoParquet metadata is marketed more strongly as optional, there is always the risk that it simply won't be implemented by readers either. My big hangup is the distinction between a file that has metadata and not using it because you don't know it is there, and a file that doesn't have any metadata, in which case the assumption is that everything can be used at the default value. If the convention is to enable use without reading metadata, then all you can support are the defaults. Full stop. Geometry encodings are a separate concept to discuss; here the focus should be on whether there is a sensible default that can be allowed without storing it in metadata. We wanted WKB to be required in order to allow specifying GeoArrow (or something else) at a later date, but we could instead make this optional with the default value being WKB, and that if GeoArrow is used it needs to be set in the metadata and the reader / writer needs to support that metadata. Also, at this point based on the implementations so far, I'd suggest that at this point, the default is WKB. Briefly: WKT seems counter to the goals here: a portable, high (as possible) performance binary encoding within a high performance binary column-oriented container. WKT is bigger to store and therefore slower to parse. Human-readable should not be a goal for this encoding for storage / transmission. For humans, this is a display problem; your engine needs to decode WKB and show WKT when previewing query results or something, which is what it would need to do anyway if it had an internal representation for geometry objects. I.e., it is the same problem if you were storing image bytes in a column, right? In a separate issue, so that we can discuss that specifically, I'd be interested to hear more about the use cases where WKT might make sense and not counter to the goals around speed and size. Using data type of an inferred geometry column (e.g., "geometry") to infer its encoding without reading metadata, e.g., WKB vs WKT, seems like a slippery slope. It isn't too many steps beyond that to then say if it is a string and the first character is "{", assume that it is a GeoJSON geometry object. And then we have a mess... |
Beta Was this translation helpful? Give feedback.
-
Just some thoughts... As far as I understand, the aim is to use sensible defaults so that no metadata needs to be added. So I assume that once there's a single field that we can't find a sensible default for, the main purpose of the request is not fullfilled, right? Looking at the fields, the optional fields I assume have sensible defaults. So focus should likely be on required fields.
The most difficult thing seems to the version number. The other question is, is it really the right approach? It feels that this may end up in a situation where files with no metadata will be generated but they often don't follow the defaults. That means you may end up with clients that need to "detect" the right values. The alternative approach is to give users simple tools to "convert" their exports from Presto/Trino/Athena into proper geoparquet. A simple CLI tool where you just pass in the exported files and maybe one or two options and it adds the metadata. I could imagine that this may be the better approach at least from a specification perspective. |
Beta Was this translation helpful? Give feedback.
-
To me there are two competing use cases:
I think this discussion is tailored primarily towards the first case? I'm wondering if this might be solved by having GeoParquet client APIs that allow a user to pass in metadata that they know to be true. On the latter, I think it's really hard to handle source data without metadata. Taking GeoJSON as an example, it's not trivial (at least without parsing the content into a data structure) to check whether an input file is actually GeoJSON. And on the versioning front, given that GeoJSON doesn't have a version, it's impossible for a reader to know which version of GeoJSON is at hand. This isn't entirely abstract, either; GDAL still writes the original GeoJSON spec by default! (As described here, you have to pass In any case, my hunch is that sniffing an input Parquet file to see if it's GeoParquet would not be easy and reliable. |
Beta Was this translation helpful? Give feedback.
-
There are some very valid assumptions showing up here that we've found don't hold for users of what are effectively managed tabular query engines (i.e., Presto/Trino/Athena). Rather than manifesting as libraries/tools that can be upgraded at will by end users with geospatial needs, these managed tabular query engines are intended to be much more general-purpose and sit atop an extensive pile of dependencies. "Files" don't exist; instead, they manifest as "tables" with a limited set of column types (effectively deriving from the list of Hive Data Types, for better or for worse). Parquet is but one serialization of many, with storage/access optimizations visible to the reader/writer dependency and erased by the contract between the storage and query layers (in the name of "predicate push-down"). GeoParquet seems to have been designed with some assumptions that are difficult to square with the above (please correct me if I'm misrepresenting):
In Big Data land (which I started to inhabit when creating ORC files for OSM that could be queried using Amazon Athena), I think the equivalent points are:
I'm not sure how best to reconcile these, or that's even possible at this stage. I do know that even without access to the metadata (or with externalized metadata), the consistency that GeoParquet has encouraged has made it easier to work with geospatial data in these types of Big Data query engines. Finally, a question for the group: are "Big Data end-users" a target persona for GeoParquet? Personally, I think they should be, but we have some work to do to get there. |
Beta Was this translation helpful? Give feedback.
-
Thanks for all the great discussion everyone, and @jwass for bringing it up. We'll discuss this (and #170) synchronously at the bi-weekly GeoParquet call. Next one is May 8th, at this time (10 am my time, but use the link to convert it to your time). Email me ( at planet dot com - same as my github user name) if you'd like to join - all are welcome. |
Beta Was this translation helpful? Give feedback.
-
There were a couple scenarios that came to mind in today's discussion around the ideas of Geoparquet-compatible vs truly Geoparquet-compliant. If there is just one step between a data producer and a Geoparquet-compatible engine that can produce valid Geoparquet, and anything that is non-default is known and can be passed to that engine, then there is not too much chance of data misinterpretation:
In this case, you can produce a parquet file that has just enough information that a Geoparquet-compatible reader could read it based on the defaults (after first checking that there is no Geoparquet metdata) and either use that data internally or write out a Geoparquet compliant. So we'd be expanding the ecosystem that could produce data that could then be consuming Geoparquet-compatible files, great! But there's perhaps a more pernicious case that might lead to a much greater risk of data misinterpretation / data loss: involving Geoparquet-compatible (but not compliant) or even simply Parquet-compatible intermediate engines to do some non-geospatial filtering or other transformation step.
Now, if the original non-default information (e.g., an arbitrary planar CRS) stored in the Geoparquet file at the beginning of this chain is known at the end, no problem; you just reattach that by passing parameters to your Geoparquet- compatible reader and can produce a valid Geoparquet that still has all the information. BUT if you no longer have that information, you have a situation where the intermediate step discarded information, and if your final step to read in Geoparquet-compatible data assumes the defaults, you've now misinterpreted the data. To be clear: the intermediate is not claiming to support Geoparquet. What is tricky about this is that the intermediate step was by a non-geospatial engine; it shouldn't have to know what to do with geometry data (i.e., doesn't decode / encode geometry data) - so we perhaps can't insist that it should be a Geoparquet-compliant engine, it can just pass write a subset of the records containing WKB plus other attributes out to a parquet file. From that transformation step onward, the existence of the Geoparquet metadata was not even known or knowable without the original. You don't know what you don't know. This is a problem if the data producer at the beginning of this chain is not the data consumer at the end of it (if you are, know know all the things). The point I'm trying to make is this is a different situation yet again than the minimum requirement we set for Geoparquet-compliant readers (i.e., they have to at least check the presence of metadata and some ability to interpret that according to the spec, and be able to assume defaults or take parameters to fill those in). I'm not sure what we can do in this case other than to recommend not using Parquet-only intermediate steps that start with Geoparquet unless you can capture or control enough of that end-to-end? |
Beta Was this translation helpful? Give feedback.
-
Hello - after using some existing parquet tooling to try to write GeoParquet, one source of friction I (and some colleagues) have encountered is that many tools don't normally expose writing/accessing the parquet metadata. For example Presto/Trino/Athena can write Parquet files as the result of a SQL query but don't have a way to specify the metadata. Similarly other tools I've used that convert tabular data into parquet or other formats don't expose a path to write metadata out - because they treat parquet files as just simple tables. I think this would be true for duckdb too though not 100% sure.
I'm wondering if the metadata is really needed at all? An analogy is that GeoJSON files don't have any special marking to indicate they're GeoJSON, they just happen to have the right fields/structure. Can GeoParquet do the same? In other words you could say something like "a parquet file is a GeoParquet file if it has a column called 'geometry' with WKB-formatted data".
More concretely, I think the spec would change along these lines:
If the spec could support this, there are tons of tools today that could just write GeoParquet without any additional/special spatial awareness.
Beta Was this translation helpful? Give feedback.
All reactions