Make the Metadata Optional? #169

jwass · 2023-04-28T19:37:51Z

jwass
Apr 28, 2023
Collaborator

Hello - after using some existing parquet tooling to try to write GeoParquet, one source of friction I (and some colleagues) have encountered is that many tools don't normally expose writing/accessing the parquet metadata. For example Presto/Trino/Athena can write Parquet files as the result of a SQL query but don't have a way to specify the metadata. Similarly other tools I've used that convert tabular data into parquet or other formats don't expose a path to write metadata out - because they treat parquet files as just simple tables. I think this would be true for duckdb too though not 100% sure.

I'm wondering if the metadata is really needed at all? An analogy is that GeoJSON files don't have any special marking to indicate they're GeoJSON, they just happen to have the right fields/structure. Can GeoParquet do the same? In other words you could say something like "a parquet file is a GeoParquet file if it has a column called 'geometry' with WKB-formatted data".

More concretely, I think the spec would change along these lines:

All required metadata fields are optional. The "primary_column" defaults to "geometry" (or whatever). It can't be repeated, nested, complex, etc.
The "columns" field would just be the one "geometry" column.
The version defaults to... (something here?).
The metadata itself is also optional and assumes the defaults when absent.

If the spec could support this, there are tons of tools today that could just write GeoParquet without any additional/special spatial awareness.

tschaub · 2023-04-28T21:04:59Z

tschaub
Apr 28, 2023
Collaborator

I really like the idea of having sensible defaults and avoiding the "geo" file metadata altogether.

There is interest in supporting different geometry encodings, and there may be concern about "locking in" WKB as the default binary encoding. I know you may be holding off on suggesting this here, but one option would be to say that if the geometry column is a string type, then the default encoding is WKT. Perhaps the initial version of the spec would not describe support for a binary encoding. That could leave more time to discover what might make for an optimal binary encoding, and a future version of the spec could describe the default in cases where the geometry column is binary without breaking compatibility.

3 replies

kylebarron Apr 28, 2023
Maintainer

That could leave more time to discover what might make for an optimal binary encoding

To my knowledge the main other encoding for future consideration is something like GeoArrow, and that wouldn't be encoded in Parquet as a binary column (I don't think), but rather nested arrays of a primitive type. I don't know what other binary encodings than WKB would be likely.

jedsundwall Apr 28, 2023

+1 for the pursuit of sensible defaults, particularly when what's "sensible" is informed by an understanding of what commonly used tooling will support.

Take this with a grain of salt because I've done zero research on the technical benefits of WKB relative to WKT, but I have a strong bias in favor of human-readable formats whenever possible because it's so helpful with education and onboarding of new users. I assume that competition among infrastructure providers will continue to create more performant tools that will make the difference between WKB and WKT less significant over time.

cholmes May 1, 2023
Maintainer

Ok, I just spun off a new discussion topic on WKB vs WKT. Please sound in there - I'm compelled by human readable by default, and curious if default parquet compression might mean that WKB isn't actually smaller / faster than WKT in Parquet.

I do think going with WKT just to not 'lock-in' the binary format is a little weird - we should just decide what is the best way to go. I was thinking that if we did want to change the default format then we could just go to 'GeoParquet 2.0' - like version 1.1 would have the option to use GeoArrow columnar geometries, and if we wanted to make it so the 'naive geoparquet' (all defaults) worked with it then we'd call it 2.0.

brendan-ward · 2023-04-29T01:37:23Z

brendan-ward
Apr 29, 2023

Provided I'm thinking of this correctly, this conflates having sensible defaults when writing, with lack of support for reading metadata, which may ultimately harm the goals here.

It is one thing to allow reading a GeoParquet file that has no metadata, in which case sensible defaults make sense. We tried to achieve some degree of consensus around sensible defaults and a relative minimum of absolutely required fields, and we can certainly try to figure out how to reduce those further while preserving the overall goals.

It is another thing entirely to allow reading a GeoParquet file that has metadata but reading of metadata is not supported by a particular engine. That means that as soon as we create one that does not follow the defaults, we risk misinterpretation by the reader, potentially with bad results (e.g., projected CRS).

We elaborated a bit on other issues that there are use cases where it is important to not force reprojection to the default CRS as that harms internal workflows (which may be between separate engines, e.g., Python and R), even if we have guidelines that encourage (but not require) using the default CRS when sharing publicly. Meaning: there are and continue to be use cases where we need the metadata in place to handle reading these correctly.

I am of course assuming that if we're talking about supporting writes by engines that do not support parquet metadata that we're also talking about supporting reads by those same engines. Maybe that's not a thing? Some of our earlier thinking around this is that GeoParquet is a portable file format between different engines that have their own internal geometry support, but maybe now we're also talking about use cases where incoming geometry data may be simply stored and later retrieved (like other attributes) and not used directly; i.e., I/O with engines that do not themselves deal in geometries. The idea of producers that are not themselves consumers is an interesting one for us to dig more into.

It would be helpful to know more about this use case, especially around the read side of things.

Could we be strict and say that all readers must have the capability to read GeoParquet metadata at least so far as to determine that no metadata is present (in which case, assume the defaults)? But if GeoParquet metadata is marketed more strongly as optional, there is always the risk that it simply won't be implemented by readers either.

My big hangup is the distinction between a file that has metadata and not using it because you don't know it is there, and a file that doesn't have any metadata, in which case the assumption is that everything can be used at the default value. If the convention is to enable use without reading metadata, then all you can support are the defaults. Full stop.

Geometry encodings are a separate concept to discuss; here the focus should be on whether there is a sensible default that can be allowed without storing it in metadata. We wanted WKB to be required in order to allow specifying GeoArrow (or something else) at a later date, but we could instead make this optional with the default value being WKB, and that if GeoArrow is used it needs to be set in the metadata and the reader / writer needs to support that metadata. Also, at this point based on the implementations so far, I'd suggest that at this point, the default is WKB.

Briefly: WKT seems counter to the goals here: a portable, high (as possible) performance binary encoding within a high performance binary column-oriented container. WKT is bigger to store and therefore slower to parse. Human-readable should not be a goal for this encoding for storage / transmission. For humans, this is a display problem; your engine needs to decode WKB and show WKT when previewing query results or something, which is what it would need to do anyway if it had an internal representation for geometry objects. I.e., it is the same problem if you were storing image bytes in a column, right?

In a separate issue, so that we can discuss that specifically, I'd be interested to hear more about the use cases where WKT might make sense and not counter to the goals around speed and size.

Using data type of an inferred geometry column (e.g., "geometry") to infer its encoding without reading metadata, e.g., WKB vs WKT, seems like a slippery slope. It isn't too many steps beyond that to then say if it is a string and the first character is "{", assume that it is a GeoJSON geometry object. And then we have a mess...

5 replies

jwass May 1, 2023
Collaborator Author

Thanks. I see your point with a lot of these objections here.

I guess what I was trying to point out and solve is that what excites me most about GeoParquet is its interoperability in the broader parquet/arrow ecosystem but I think GeoParquet is mostly incompatible until everything gets special GeoParquet support.

For example if I have a GeoParquet file and want to do something simple like filter the rows where the country column is set to 'US' to create a new Parquet file, there are a huge number of tools I might reach for to do this: Athena/Presto/Trino, Pandas, DataFusion, duckdb, ... All of these can do this perfectly fine and easily today, but what might be surprising to many people is that none will produce a valid GeoParquet file. (I checked the docs on some of these and many of these do not allow writing custom metadata). So they can build perfectly well-formed Parquet including the correct spatial data, but missing a tiny but important amount of metadata makes it "not GeoParquet" and incompatible with other GeoParquet readers. The problem as you've pointed out with this example is that if the input GeoParquet has non-default CRS, then the output file would be incorrect if the defaults assume WGS84. But the overall result is we can't do basic Parquet/table-like things with any of those tools until each gets explicit GeoParquet support just because one of the binary columns happens to contain spatial data. It's obviously okay for that to be intentional by the GeoParquet community - I just found it surprising when trying to create some GeoParquet files and was relaying this here (and suggesting a solution). But to me it puts many tools out of reach for GeoParquet for some time and slows adoption IMO. But maybe I'm completely wrong and supporting writing the metadata is easier than I think...?

I think the suggestion below that GeoParquet client libraries could force the user to specify everything if the metadata is missing is workable.

cholmes May 1, 2023
Maintainer

If the convention is to enable use without reading metadata, then all you can support are the defaults. Full stop.

This seems reasonable to me, and worth spelling out if we go this route. In OGC terms I think this would be defined as like a 'Simple GeoParquet Conformance Class'. Readers and writers would both say what they support, and I agree a good reader that only supports Simple GeoParquet should raise an error if it's fed any data that is not the defaults. If the reader supports more than the defaults then it's expected to read the metadata.

cholmes May 1, 2023
Maintainer

I guess what I was trying to point out and solve is that what excites me most about GeoParquet is its interoperability in the broader parquet/arrow ecosystem but I think GeoParquet is mostly incompatible until everything gets special GeoParquet support.
...
But to me it puts many tools out of reach for GeoParquet for some time and slows adoption IMO.

So I had pretty much accepted that we'd have a fairly slow path to adoption - that we'd have to slow start teaching everything 'special GeoParquet support'. And indeed we're working to convince DuckDB, Snowflake, Redshift, etc to implement it (with decently positive feedback so far). But I am very intrigued by your suggestions of simplifying things even more such that it's even easier to adopt for the most straightforward use case of all defaults, so would love to see if we can figure something out. I do think it's important that our goal is to try to teach every tool to actually read and write the proper metadata, since the whole ecosystem I believe will work much better if that's the case. But I could see this reasonable defaults being a pathway to get more tools working sooner, and then they'd see benefit to writing metadata / supporting more of the spec.

But maybe I'm completely wrong and supporting writing the metadata is easier than I think...?

I'm not sure - it seems like you've had trouble doing it? I hadn't anticipated that it would be an issue, but if it makes your life easier as a large data producer than I think it's worth considering. My hope is that we can all advocate to the tools that you use to support writing out geo metadata. But it's a 'chicken & egg' time - we need to get more data producers putting real data into GeoParquet to convince more software to implement it, which in turn hopefully makes it easier for more data producers to put it in.

jorisvandenbossche May 2, 2023
Maintainer

If the convention is to enable use without reading metadata, then all you can support are the defaults. Full stop.

This seems reasonable to me, and worth spelling out if we go this route.

I think Brendan was mostly trying to make the stakes clear (the consequence of supporting reading without metadata is that you can only support that), but at least from our GeoPandas point of view, this is far from a reasonable option. It would defeat a large part of the purpose of GeoParquet.

I agree a good reader that only supports Simple GeoParquet should raise an error if it's fed any data that is not the defaults. If the reader supports more than the defaults then it's expected to read the metadata.

But this is exactly the problem that is being raised? That many readers don't have access to the metadata, so then you also can't check if there are metadata to raise an error about it.

I certainly understand the desire to simplify the default option (and rationale in the top post for this), but the big downside of this is that this actually hurts interoperability, if we keep allowing the metadata as well. For example, GeoPandas will likely keep writing metadata for projected datasets, and that would mean that all those files written by GeoPandas would not be correctly read by tools that only support the non-metadata version.

cholmes May 2, 2023
Maintainer

I agree a good reader that only supports Simple GeoParquet should raise an error if it's fed any data that is not the defaults. If the reader supports more than the defaults then it's expected to read the metadata.

But this is exactly the problem that is being raised? That many readers don't have access to the metadata, so then you also can't check if there are metadata to raise an error about it.

It seems to me there are (at least) 2 things for a 'simple' reader to check:

Are the coordinates all within -180/180? If they are not then it's obviously not long, lat and should raise an error. It seems like this should catch a lot? I've not done comprehensive research on this, but in my experience most of the other projections I use are well beyond this.
Is the column named 'geometry'? If not then it's not valid, and shouldn't be read.

We could go even further the column name, and name it something more obscure, like '___geometry'. Or even make it a special keyword, that isn't allowed for anything but a 'no metadata' geoparquet.

m-mohr · 2023-04-29T10:34:10Z

m-mohr
Apr 29, 2023
Collaborator

Just some thoughts...

As far as I understand, the aim is to use sensible defaults so that no metadata needs to be added. So I assume that once there's a single field that we can't find a sensible default for, the main purpose of the request is not fullfilled, right?

Looking at the fields, the optional fields I assume have sensible defaults. So focus should likely be on required fields.

primary_column: could be e.g. geometry, that seems reasonable
version: This is difficult. I assume we'd need to bind the geoparquet version to the parquet version to allow future enhancements? This has some implications. Updates to geoparquet need to wait for parquet updates, for example.
- parquet <= 2.9.0 => as specified in metdata
- parquet 2.10.0 => geoparquet 1.0.0
- parquet 2.11.0 => geoparquet 1.1.0 (or 1.0.0 if we don't have any updates yet)
- ...
columns: This depends on the Column Metadata where there are two required fields:
- encoding => WKB
- geometry_types => The only reasonable default can be the empty array (now known) I assume. Feels like a bad default though.

The most difficult thing seems to the version number. The other question is, is it really the right approach? It feels that this may end up in a situation where files with no metadata will be generated but they often don't follow the defaults. That means you may end up with clients that need to "detect" the right values.

The alternative approach is to give users simple tools to "convert" their exports from Presto/Trino/Athena into proper geoparquet. A simple CLI tool where you just pass in the exported files and maybe one or two options and it adds the metadata. I could imagine that this may be the better approach at least from a specification perspective.

1 reply

cholmes May 1, 2023
Maintainer

The most difficult thing seems to the version number. The other question is, is it really the right approach? It feels that this may end up in a situation where files with no metadata will be generated but they often don't follow the defaults. That means you may end up with clients that need to "detect" the right values.

Yeah, I agree version is by far the trickiest. We're basically locked into these defaults for geoparquet 1.x, and sorta forever. I think my hope would be that this is 'transitional' - we don't necessarily have defaults forever, but use this as a way to help make it easy to get data over into geoparquet. I could see even GeoParquet 1.1 requiring metadata, but 1.0 would allow this defaults way.

kylebarron · 2023-04-29T13:47:39Z

kylebarron
Apr 29, 2023
Maintainer

To me there are two competing use cases:

As a user, I know a specific Parquet file I wrote has geospatial data.
As a program, the program is pointed to a Parquet file of unknown origin and is asked to interpret it.

I think this discussion is tailored primarily towards the first case? I'm wondering if this might be solved by having GeoParquet client APIs that allow a user to pass in metadata that they know to be true.

On the latter, I think it's really hard to handle source data without metadata. Taking GeoJSON as an example, it's not trivial (at least without parsing the content into a data structure) to check whether an input file is actually GeoJSON. And on the versioning front, given that GeoJSON doesn't have a version, it's impossible for a reader to know which version of GeoJSON is at hand. This isn't entirely abstract, either; GDAL still writes the original GeoJSON spec by default! (As described here, you have to pass RFC7946=YES to force the newer GeoJSON spec). IIRC the original GeoJSON spec didn't declare a winding order, and so whenever you receive GeoJSON you don't know if your data has a given winding order.

In any case, my hunch is that sniffing an input Parquet file to see if it's GeoParquet would not be easy and reliable.

6 replies

kylebarron Apr 29, 2023
Maintainer

This is a bit out of topic here

Apologies! I was not intending for that to be a complaint about GDAL! I only wanted to point to the complexities of handling multiple versions of a format when that format has no metadata declaring its version.

jwass May 1, 2023
Collaborator Author

Thanks, @kylebarron.

To me there are two competing use cases:

As a user, I know a specific Parquet file I wrote has geospatial data.

As a program, the program is pointed to a Parquet file of unknown origin and is asked to interpret it.
I think this discussion is tailored primarily towards the first case? I'm wondering if this might be solved by having GeoParquet client APIs that allow a user to pass in metadata that they know to be true.

Right - I think I'm focused on the first one where I have a parquet file with spatial data whose metadata I know to be true... (and I need to know a bunch of things about the other columns too to use them correctly even though that info won't be baked into the file metadata.) My guess is this would be a very common use-case if I want to use lots of tools that exist today that can process parquet but not yet GeoParquet.

I like your suggestion here that if GeoParquet client APIs could force the user to pass in the metadata when it's missing seems reasonable. These parquet files could then at least be called "GeoParquet-compatible" or just call them GeoParquet (my core suggestion) and say the user needs to bring the CRS and other info.

cholmes May 1, 2023
Maintainer

+1 to the idea of making users pass in the metadata for parquet files that are close to geoparquet. Seems like it'd be a good practice to recommend to any geoparquet reader - the ability to have user-supplied geo metadata when the geo metadata isn't present. But I do also like the idea of allowing producers to make data that is 'all defaults' and then it requires no input from the user to be able to use it (but the reader would check to make sure it really is all defaults)

These parquet files could then at least be called "GeoParquet-compatible" or just call them GeoParquet (my core suggestion) and say the user needs to bring the CRS and other info.

I like the idea of 'geoparquet-compatible' if it's any other CRS or non-default info, but has a geometry column defined in the default geometry encoding. We just need to be sure our core set of tooling (gdal, geopandas, etc) supports that flow - makes it easy to supply metadata.

And I think I also like the idea of making it acceptable to be called 'GeoParquet' (geoparquet compliant) if it uses all defaults. We'd need to specify the default column name (geometry), and geometry type (has to be that it'd accept all), and it'd need to be 'longitude, latitude based on the WGS84 datum'.

jorisvandenbossche May 2, 2023
Maintainer

I would go the other way around? A "GeoParquet" file is what we have specified up to now (with the metadata), and you can make a "GeoParquet compatible" file that follows the defaults but has no metadata (i.e. a file that could be read by a GeoParquet-specific reader if assuming the defaults)

cholmes May 2, 2023
Maintainer

I'm also ok with that. Another idea is to call the all defaults one a 'Simple GeoParquet', and specify its requirements. OGC has used that 'simple' designation in the past. I don't usually love it as it often comes much after the more complicated spec, but I could see it making sense here.

mojodna · 2023-05-01T21:29:50Z

mojodna
May 1, 2023

There are some very valid assumptions showing up here that we've found don't hold for users of what are effectively managed tabular query engines (i.e., Presto/Trino/Athena).

Rather than manifesting as libraries/tools that can be upgraded at will by end users with geospatial needs, these managed tabular query engines are intended to be much more general-purpose and sit atop an extensive pile of dependencies. "Files" don't exist; instead, they manifest as "tables" with a limited set of column types (effectively deriving from the list of Hive Data Types, for better or for worse). Parquet is but one serialization of many, with storage/access optimizations visible to the reader/writer dependency and erased by the contract between the storage and query layers (in the name of "predicate push-down").

GeoParquet seems to have been designed with some assumptions that are difficult to square with the above (please correct me if I'm misrepresenting):

GeoParquet files will be distributed / accessed as files (vs. accessed as tables that can be serialized into many forms/formats).
These files will be small (relative to TB-scale "Big Data" tables).
Datasets / layers will be distributed as individual files (vs. serialized into partitions and size-limited files, which may result in an arbitrary number of outputs).
Users have access to low-level APIs, file format details, and metadata (including things like the Parquet version).
Geospatial metadata allows predicate push-down to be more efficient and functions to be aware of out-of-band (relative to a column) information like CRS, but requires specific knowledge and implementations.
Geospatial metadata is layer/file-specific and represented as such (I haven't seen row groups discussed around things like more granular bounding boxes).

In Big Data land (which I started to inhabit when creating ORC files for OSM that could be queried using Amazon Athena), I think the equivalent points are:

Tables are agnostic of how/where they're stored
Tables may range from 1-billions of rows.
Datasets are backed in a variety of ways and specifics are opaque to users.
When writing/exporting, it's difficult-to-impossible to limit the number of resulting files to a single output, and (as-discussed) dataset-level metadata is difficult-to-impossible to include, even with engines like Apache Spark (due to the way they use the underlying writers). The story is similar when reading (or registering existing sources). Catalogs (where sources are registered) may have some additional metadata, but it is usually only exposed to readers/writers and not to end-users.
Predicate push-down for standard types exists, and allows engines to use standard file- and row group-level metadata to skip files and row groups by comparing to min/max values and Bloom filters. E.g., a bbox represented as a pair of arrays (Xs and Ys) will produce min/max stats equivalent to the MBR for the row group or file. Specific geospatial-aware push-down implementations need to be made within readers.
The basic unit is the column (which could be a nested struct), and engines only surface "rows" to end-users (despite internal, columnar representations). For geospatial metadata to be accessible as-is, it would need to be part of an individual column struct (and repeated, possibly space-efficiently within the underlying file(s)).

I'm not sure how best to reconcile these, or that's even possible at this stage. I do know that even without access to the metadata (or with externalized metadata), the consistency that GeoParquet has encouraged has made it easier to work with geospatial data in these types of Big Data query engines.

Finally, a question for the group: are "Big Data end-users" a target persona for GeoParquet? Personally, I think they should be, but we have some work to do to get there.

2 replies

TomAugspurger May 1, 2023
Maintainer

I'll share an example in #118, but I have a prototype example of storing a bunch of geoparquet files in delta table format. IIRC it has roughly 30,000 individual parquet files.

Datasets are backed in a variety of ways and specifics are opaque to users.

I'm not familiar with Athena, but at least for Delta the raw parquet files are there. In my prototype, I've manually updated the files to add the geoparquet metadata. I don't know if that's officially supported, but stuff seems to work. I'm able to do partitioned queries (with deltalake) and hand the matching URIs to geopandas. The geopandas layer is the first time that geoparquet is involved.

That said, I think it's safe to assume that geoparquet is specific to parquet :) If the system is storing stuff in something other than parquet, then I think there's nothing that geoparquet can or should do.

I think that the work done so far with geoparquet is a good starting point for these higher-level, Table oriented systems. Somebody (whether it's geopandas, spark, ...) or some black box that's part of a larger system is going to be reading these parquet files. And I think they'll appreciate being able to use the parquet metadata to interpret the bytes.

cholmes May 2, 2023
Maintainer

Finally, a question for the group: are "Big Data end-users" a target persona for GeoParquet? Personally, I think they should be, but we have some work to do to get there.

I'll start at the bottom, and then sound in with some thoughts. I think 'Big Data end-users' is a target persona, but probably not the target persona of '1.0'. Indeed we've cut a lot of scope for 1.0, like no mention of spatial partitioning or optimization, in favor of just driving towards some consistency for the 'import/export' use case. But I think we'd like to evolve to serve Big Data end-users better and better. So the pushes are appreciated, but we may not get there right away.

GeoParquet files will be distributed / accessed as files (vs. accessed as tables that can be serialized into many forms/formats).

Tables are agnostic of how/where they're stored

I agree with Tom here - we are focused on the files / the serialization, not on being the representation of generic 'big data geo tables'. Geoparquet is defined as specific to parquet. GeoArrow is another project, and hopefully will help get us cooler geometry definitions, but we're not trying to define the memory layout. And so too with these table abstractions - we have some nice patterns that can be used, we can help when parquet is beneath it. To be clear I am psyched on the idea of externalizing metadata like you raised in #118, but I think it's out of scope for 1.0. And indeed maybe should be a separate spec, that's clear guidelines for table-defined metadata.

These files will be small (relative to TB-scale "Big Data" tables).

Datasets / layers will be distributed as individual files (vs. serialized into partitions and size-limited files, which may result in an arbitrary number of outputs).

This one to me is out of scope for 1.0, but my hope is that we tackle #79 (geospatial partitioning) in 1.1. So yes, a single geoparquet file will be relatively small, but I hope we have a way to handle terabyte scale data through a number of geoparquet files. Right now there's nothing preventing someone from using lots of geoparquet files, as @TomAugspurger is doing, but we're not yet making the recommendation of how to do that, just to keep the scope small and clear. But your experience @mojodna would be great when we get to it. My hope is people start to do stuff and then in 1.1 we formalize what's been done.

Users have access to low-level APIs, file format details, and metadata (including things like the Parquet version).

My hope is that 'users' don't always need this access, but that we could have users of higher level 'table' systems that leverage this data to stream it into their systems that users interact with.

Geospatial metadata is layer/file-specific and represented as such (I haven't seen row groups discussed around things like more granular bounding boxes).

I'd say this is the other big topic for post-1.0, really figuring out how to have more columnar / native-compatible geometries can work. I think the current discussion tracking the latest is at geoarrow/geoarrow#26 We put in the encoding column to anticipate this.

cholmes · 2023-05-01T23:33:23Z

cholmes
May 1, 2023
Maintainer

Thanks for all the great discussion everyone, and @jwass for bringing it up. We'll discuss this (and #170) synchronously at the bi-weekly GeoParquet call. Next one is May 8th, at this time (10 am my time, but use the link to convert it to your time). Email me ( at planet dot com - same as my github user name) if you'd like to join - all are welcome.

1 reply

jatorre May 4, 2023

Thanks everybody. I am arriving late to the discussion, but really great points all around. I am super excited of sensible defaults that could speed up support, but super worry that by doing so we might end up in a worse compatibility scenario too.

I think the case of GeoJSON is a good example of a problem as @kylebarron mentioned. And in fact you see it on how geojson support is for example added on engines like BigQuery or Snowflake where you have now parameters like --json_extension=GEOJSON . Parquet created extensible metadata to be used in part to avoid this...

On the 2 type of use cases very well though-out too. I dont think geoparquet 1.0 would be a good format for big data use cases, but thats regardless of the sage of defaults, thats more of the usage of WKT/WKB. You can use it on big data by adding your own extra columns like Quadkeys and partition on your own, like https://daylightmap.org/earth/ does. And there is definetly no consesus and a ton of research still being done in that space. So I dont think we can aim at standarizing that just yet. Hopefully geoarrow and support for Global Grid Systems could help change that soon. But for now, if you are doing "big data" things on geo you are on your own. There is and, there will not be a for a while, consensus on how to structure geospatial data for effective distributed computing on a Data Lake type of architecture.
So I would not use that as a factor on our decision.

Very much looking forward to our discussion.

brendan-ward · 2023-05-08T18:45:18Z

brendan-ward
May 8, 2023

There were a couple scenarios that came to mind in today's discussion around the ideas of Geoparquet-compatible vs truly Geoparquet-compliant.

If there is just one step between a data producer and a Geoparquet-compatible engine that can produce valid Geoparquet, and anything that is non-default is known and can be passed to that engine, then there is not too much chance of data misinterpretation:

[non-Geoparquet data producer] => [parquet with WKB geometry column & defaults] => [Geoparquet compatible reader + known params or defaults] => [valid Geoparquet]

In this case, you can produce a parquet file that has just enough information that a Geoparquet-compatible reader could read it based on the defaults (after first checking that there is no Geoparquet metdata) and either use that data internally or write out a Geoparquet compliant. So we'd be expanding the ecosystem that could produce data that could then be consuming Geoparquet-compatible files, great!

But there's perhaps a more pernicious case that might lead to a much greater risk of data misinterpretation / data loss: involving Geoparquet-compatible (but not compliant) or even simply Parquet-compatible intermediate engines to do some non-geospatial filtering or other transformation step.

[Geoparquet file w/ WKB geometry column] => [non-geospatial engine that can write parquet, ignores Geoparquet metadata on read] => [filtering / transformation step] => [parquet file with WKB geometry column and ??? status with respect to defaults] => [Geoparquet-compatible reader + ??? parameters] => [Geoparquet file??]

Now, if the original non-default information (e.g., an arbitrary planar CRS) stored in the Geoparquet file at the beginning of this chain is known at the end, no problem; you just reattach that by passing parameters to your Geoparquet- compatible reader and can produce a valid Geoparquet that still has all the information.

BUT if you no longer have that information, you have a situation where the intermediate step discarded information, and if your final step to read in Geoparquet-compatible data assumes the defaults, you've now misinterpreted the data. To be clear: the intermediate is not claiming to support Geoparquet.

What is tricky about this is that the intermediate step was by a non-geospatial engine; it shouldn't have to know what to do with geometry data (i.e., doesn't decode / encode geometry data) - so we perhaps can't insist that it should be a Geoparquet-compliant engine, it can just pass write a subset of the records containing WKB plus other attributes out to a parquet file. From that transformation step onward, the existence of the Geoparquet metadata was not even known or knowable without the original. You don't know what you don't know. This is a problem if the data producer at the beginning of this chain is not the data consumer at the end of it (if you are, know know all the things).

The point I'm trying to make is this is a different situation yet again than the minimum requirement we set for Geoparquet-compliant readers (i.e., they have to at least check the presence of metadata and some ability to interpret that according to the spec, and be able to assume defaults or take parameters to fill those in). I'm not sure what we can do in this case other than to recommend not using Parquet-only intermediate steps that start with Geoparquet unless you can capture or control enough of that end-to-end?

0 replies

Make the Metadata Optional? #169

jwass Apr 28, 2023 Collaborator

Replies: 7 comments · 18 replies

tschaub Apr 28, 2023 Collaborator

kylebarron Apr 28, 2023 Maintainer

jedsundwall Apr 28, 2023

cholmes May 1, 2023 Maintainer

brendan-ward Apr 29, 2023

jwass May 1, 2023 Collaborator Author

cholmes May 1, 2023 Maintainer

cholmes May 1, 2023 Maintainer

jorisvandenbossche May 2, 2023 Maintainer

cholmes May 2, 2023 Maintainer

m-mohr Apr 29, 2023 Collaborator

cholmes May 1, 2023 Maintainer

kylebarron Apr 29, 2023 Maintainer

kylebarron Apr 29, 2023 Maintainer

jwass May 1, 2023 Collaborator Author

cholmes May 1, 2023 Maintainer

jorisvandenbossche May 2, 2023 Maintainer

cholmes May 2, 2023 Maintainer

mojodna May 1, 2023

TomAugspurger May 1, 2023 Maintainer

cholmes May 2, 2023 Maintainer

cholmes May 1, 2023 Maintainer

jatorre May 4, 2023

brendan-ward May 8, 2023

jwass
Apr 28, 2023
Collaborator

Replies: 7 comments 18 replies

tschaub
Apr 28, 2023
Collaborator

kylebarron Apr 28, 2023
Maintainer

cholmes May 1, 2023
Maintainer

brendan-ward
Apr 29, 2023

jwass May 1, 2023
Collaborator Author

cholmes May 1, 2023
Maintainer

cholmes May 1, 2023
Maintainer

jorisvandenbossche May 2, 2023
Maintainer

cholmes May 2, 2023
Maintainer

m-mohr
Apr 29, 2023
Collaborator

cholmes May 1, 2023
Maintainer

kylebarron
Apr 29, 2023
Maintainer

kylebarron Apr 29, 2023
Maintainer

jwass May 1, 2023
Collaborator Author

cholmes May 1, 2023
Maintainer

jorisvandenbossche May 2, 2023
Maintainer

cholmes May 2, 2023
Maintainer

mojodna
May 1, 2023

TomAugspurger May 1, 2023
Maintainer

cholmes May 2, 2023
Maintainer

cholmes
May 1, 2023
Maintainer

brendan-ward
May 8, 2023