Item and Collection parquet Models #25

vincentsarago · 2023-09-19T02:57:20Z

vincentsarago
Sep 19, 2023

This should note be considered as an Issue but as a Discussion (but not enabled in this repo yet)

👋 @TomAugspurger , Thanks for starting this tool. I'm personally interested in stac-geoparquet to create easily shareable files for large STAC Collections. My usual way of doing is to create NewLine delimited GeoJSON (https://github.com/vincentsarago/MAXAR_opendata_to_pgstac) but GeoParquet seems to be a nice alternative and will also provide some simple Query capacity.

I've looked at the code and implemented a simplified version of a STAC to GeoParquet function. I say simplified because I really tried to minimize the data model, mostly by not creating column for properties properties.

from typing import Any, Dict

import json
import geopandas as gpd
import pandas as pd
import shapely


# From https://github.com/stac-utils/stac-geoparquet/blob/main/stac_geoparquet/utils.py#L60-L72
def fix_empty_multipolygon(
    item_geometry: Dict[str, Any]
) -> shapely.geometry.base.BaseGeometry:
    # Filter out missing geoms in MultiPolygons
    # https://github.com/shapely/shapely/issues/1407
    # geometry = [shapely.geometry.shape(x["geometry"]) for x in items2]
    if item_geometry["type"] == "MultiPolygon":
        item_geometry = dict(item_geometry)
        item_geometry["coordinates"] = [
            x for x in item_geometry["coordinates"] if any(x)
        ]

    return shapely.geometry.shape(item_geometry)


    with open("items_s3.json", "r") as f:
    items = [json.loads(line.rstrip()) for line in f.readlines()]

# Create A GeoParquet
geometry = []
for item in items:
    # Create geometries
    geometry.append(fix_empty_multipolygon(item.pop("geometry")))

    # Create `datetime`/`start_datetime`/`end_datetime` columns
    item["datetime"] = item["properties"]["datetime"]
    item["start_datetime"] = item["properties"].get("start_datetime")
    item["end_datetime"] = item["properties"].get("end_datetime")
    
    # TODO: Maybe split the bbox in minx/miny/maxx/maxY columns

    # Convert Dict and List to String
    for k, v in item.items():
        if isinstance(v, (dict, list)):
            item[k] = json.dumps(v)


gdf = gpd.GeoDataFrame(items, geometry=geometry, crs="WGS84")   
gdf.to_parquet("items.parquet")

In ☝️ I'm creating columns for each STAC object properties (not the item properties) and creating columns for the datetimes properties (to ease temporal filtering). But then I'm creating string for all the List and Dict object.

Model:

Type	stac_version	id	properties	datetime	start_datetime	end_datetime	links	assets	bbox	stac_extensions	collection	geometry
str	str	str	JSON str	date	date	date	JSON str	JSON str	JSON str	JSON str	str	geom

I do the same for the collection

with open("collections.json", "r") as f:
    collections = [json.loads(line.rstrip()) for line in f.readlines()]

# Create A GeoParquet
geometry = []
for collection in collections:
    # Create geometries from extent
    geometry.append(
        # The extent is in form of `global bbox`, `small bbox1`, `small bbox 2` , ....
        shapely.geometry.GeometryCollection(
            [
                shapely.geometry.Polygon.from_bounds(*bbox)
                for bbox in collection["extent"]["spatial"]["bbox"]
            ]
        )
    )

    interval = collection["extent"]["temporal"]["interval"]
    # Create `start_datetime`/`end_datetime` columns
    collection["start_datetime"] = collection["extent"]["temporal"]["interval"][0][0] if interval else None
    collection["end_datetime"] = collection["extent"]["temporal"]["interval"][0][1] if interval else None
    
    # Convert Dict to String
    for k, v in collection.items():
        if isinstance(v, (dict, list)):
            collection[k] = json.dumps(v)


gdf = gpd.GeoDataFrame(collections, geometry=geometry, crs="WGS84")
gdf.to_parquet("collections.parquet")

cc @kylebarron @gadomski

TomAugspurger · 2023-09-19T14:01:13Z

TomAugspurger
Sep 19, 2023
Maintainer

Can you expand a bit on the reason for JSON encoding the fields like links, assets, properties, etc.?

Working with these nested fields can be a pain, but it seems to me like the tooling around these nested dtypes is improving (pandas-dev/pandas#54938, etc.). Being able to filter on, e.g. .properties.eo:cloud_cover at the parquet level seems pretty useful.

2 replies

vincentsarago Sep 20, 2023
Author

Can you expand a bit on the reason for JSON encoding the fields like links, assets, properties, etc.?

I think it's mostly that I was getting error for Object type (but don't remember exactly what 😓)

TomAugspurger Sep 20, 2023
Maintainer

Those have been a pain to try and handle. Hopefully if we use Arrow for the storage in-memory we'll be able to avoid all those issues.

TomAugspurger · 2023-09-19T14:07:54Z

TomAugspurger
Sep 19, 2023
Maintainer

BTW: I really like the idea of defining a data model (or multiple?), maybe as a pyarrow or parquet schema, for this type of data, and properly documenting it.

https://arrow.apache.org/blog/2023/04/11/our-journey-at-f5-with-apache-arrow-part-1/ and https://arrow.apache.org/blog/2023/06/26/our-journey-at-f5-with-apache-arrow-part-2/ are some pretty detailed blog posts on making a data model for OpenTelemetry data.

0 replies

kylebarron · 2023-09-19T14:30:41Z

kylebarron
Sep 19, 2023
Maintainer

BTW: I really like the idea of defining a data model and properly documenting it.

💯 there's a STAC Sprint next week that I think this would fit into well. I'm a fan of having arrow-native/parquet-native types for stac-geoparquet. Maybe I'll try to write up a "mini spec" for this with a read and write implementation using pyarrow in python? Maybe as a PR here? I think using pyarrow directly is likely to have a lot better control over the exact representation, and especially should make it easier to dictionary-encode specific string columns, which should save a ton of memory

1 reply

gadomski Sep 19, 2023
Maintainer

Yup, we chatted about this briefly yesterday during a planning call for the sprint, but a "mini-spec" for table storage would be welcomed. It could complement an API extension for (e.g.) GeoParquet responses.

TomAugspurger · 2023-09-19T14:41:53Z

TomAugspurger
Sep 19, 2023
Maintainer

I think using pyarrow directly is likely to have a lot better control over the exact representation, and especially should make it easier to dictionary-encode specific string columns, which should save a ton of memory

+1 to using arrow directly (and then somehow adding the geoarrow metadata.). There's a few places where we have to fixup issues with object-dtype ndarrays we're getting from pandas.

If we do need geopandas for anything, then we can explore pandas' new-ish support for arrow-backed arrays.

1 reply

kylebarron Sep 19, 2023
Maintainer

I did a little testing and for the repeated string columns we can reduce memory usage by a lot. I'm testing with 1000 sentinel-cogs STAC items and on the stac-extensions field, dictionary-encoding the strings reduces memory use by 60x from 440498 bytes to 7439. The type and stac_version columns have a 10x reduction.

It would probably be easiest to let the user (optionally) inform the function which types should be dictionary encoded, since it's somewhat hard to infer that accurately for the properties.

There are also other areas where storing in Arrow give additional information. For example, if you store the stac_version dictionary-encoded, you can quickly see whether all of the STAC items are at 1.0.0 or if some are at 1.0.0-beta.2.

gadomski · 2023-09-19T14:44:33Z

gadomski
Sep 19, 2023
Maintainer

This should note be considered as an Issue but as a Discussion (but not enabled in this repo yet)

I've enabled discussions and moved this over.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Item and Collection parquet Models #25

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Item and Collection parquet Models #25

vincentsarago Sep 19, 2023

Replies: 5 comments · 4 replies

TomAugspurger Sep 19, 2023 Maintainer

vincentsarago Sep 20, 2023 Author

TomAugspurger Sep 20, 2023 Maintainer

TomAugspurger Sep 19, 2023 Maintainer

kylebarron Sep 19, 2023 Maintainer

gadomski Sep 19, 2023 Maintainer

TomAugspurger Sep 19, 2023 Maintainer

kylebarron Sep 19, 2023 Maintainer

gadomski Sep 19, 2023 Maintainer

vincentsarago
Sep 19, 2023

Replies: 5 comments 4 replies

TomAugspurger
Sep 19, 2023
Maintainer

vincentsarago Sep 20, 2023
Author

TomAugspurger Sep 20, 2023
Maintainer

TomAugspurger
Sep 19, 2023
Maintainer

kylebarron
Sep 19, 2023
Maintainer

gadomski Sep 19, 2023
Maintainer

TomAugspurger
Sep 19, 2023
Maintainer

kylebarron Sep 19, 2023
Maintainer

gadomski
Sep 19, 2023
Maintainer