Sentinel GeoParquet divisions #238

sfalkena · 2023-07-12T13:41:13Z

sfalkena
Jul 12, 2023

Hi Planetary Computer team,

I love the geo-parquet interface that the datasets have, as it allows me to filter down the STAC data in an efficient manner for my existing solution for the NAIP dataset.

Going along, I wanted to implement the same workflow for Sentinel2 as well, but as Sentinel has way more assets, the parquet files have been partitioned.
Generally what I do for NAIP is:

asset = self.catalog.get_collection("naip").assets["geoparquet-items"]
df = gpd.read_parquet(asset.href, storage_options=asset.extra_fields["table:storage_options"])
items = df.iloc[df.sindex.query(polygon, predicate="intersects")]
items = items[items['naip:year'].astype(int) >= 2019]

However, as for Sentinel I load the parquet with dask-geopandas:

import dask_geopandas as dgpd 
asset = stac.get_collection("sentinel-2-l2a").assets["geoparquet-items"]
dgdf = dgpd.read_parquet(asset.href, storage_options=asset.extra_fields["table:storage_options"])

which will yield me a dask-geodataframe. The intersection of my area of interest (polygon) is not possible without computing the dgdf. What I want to achieve is filter down the dgdf by my time range and cloud coverage (ideally also filter for certain bands), then compute the dataframe and do my intersection.
Currently, my understanding is that there is two ways to filter for time:

As the parquet files are grouped per week, I could filter their names and only load those parquet files. (ugly approach)
I could filter by the “datetime” column, but that again need to compute the dataframe values.

Is there any option/workflow to reduce the dask-geodataframe in an easy way so that I end up with my items of interest?
My suggestion would be for the planetary computer interface to implement divisions into the dataframe (currently dgdf.known_divisions = False), so that I can query the dask-geodataframe efficiently with df.loc['2015-01-20': '2015-02-10'] without computing the dataframe.

If you have any other suggestion or workflow that I might have missed, please let me know.

TomAugspurger · 2023-07-12T20:17:13Z

TomAugspurger
Jul 12, 2023

A couple of quick thoughts:

You'll want to set gather_spatial_partitions=False: https://dask-geopandas.readthedocs.io/en/stable/parquet.html#partitioning, which will speed up the initial dask_geopandas.read_parquet.
At some point, we'll probably write these as delta format, similar to https://planetarycomputer.microsoft.com/dataset/ms-buildings#Example-Notebook, which I gather is (one of) the more standardized ways to do this type of partition metadata

In the meantime, I think your best bet is to filter the filenames based on the date range your interested in. I agree it's ugly, and it might break in the future if we change the naming scheme, but it's the best we have for now to filter out partitions.

Once you've figured out which partitions you want, then you'll want to apply a row filter as well:

import dask_geopandas


s2l2a = dask_geopandas.read_parquet(
    asset.href, storage_options=asset.extra_fields["table:storage_options"], gather_spatial_partitions=False, filters=[("eo:cloud_cover", "<", 10)],
)
s2l2a.head()

And you can do your clip / mask based on your polygon. Ideally by that point each partition will fit in memory.

2 replies

sfalkena Jul 14, 2023
Author

Hi Tom,

Thanks for your reaction. Just posting my approach, so you can align with me, and perhaps to help others. The current approach that I have taken for filtering dates:

import adlfs
import re
import os
from datetime import datetime

start_date="2018-01-01"
end_date="2019-01-01"

def parquet_is_in_date_range(file_string, start_date, end_date):
    start_date = datetime.strptime(start_date, '%Y-%m-%d').date() 
    end_date = datetime.strptime(end_date, '%Y-%m-%d').date() 
    # keep the parquet file 
    match = re.findall(r'\d{4}-\d{2}-\d{2}', file_string)
    dates = [datetime.strptime(date, '%Y-%m-%d').date() for date in match]
    return end_date > dates[0] and start_date < dates[1]

fs = adlfs.AzureBlobFileSystem(**asset.extra_fields["table:storage_options"])
parquet_files = [os.path.join('abfs://', x) for x in fs.ls("items/sentinel-2-l2a.parquet") if parquet_is_in_date_range(x, start_date, end_date)]

This reduces the number of partitions from 415 to 53. However, it seems like the filtering in read_parquet is not doing it's thing. I noticed that row-filtering is only supported for pyarrow engine, although not possible to set this engine. #235 in dask-geopandas is also discussing this partly and you've responded there as well. I had a quick look at the GeoArrowEngine and I guess it is using pyarrow in some form. However, I have the idea that neither row-filtering or partition-filtering is working:

dgpd.read_parquet(parquet_files, 
                 columns=['geometry', 'eo:cloud_cover'],
                 storage_options=asset.extra_fields["table:storage_options"], 
                 filter = [("eo:cloud_cover", "<", 10)],
                 gather_spatial_partitions=False,
                 ).compute()

still gives me rows with cloud cover >10. On top of that, the number of partitions do not change.

Initially, my idea was that if row filtering works reasonably, I could filter using STAC search and then use those results for row-filtering in dask. To give my method more background, I am trying to build a uniform interface in which I can leverage different PC datasets with the same interface. The STAC search was out of option for me for NAIP, as I was requesting large area's for NAIP, leading to a large number of API calls in the background. Back then I got the following error and was suggested to utilize the parquet interface.

pystac_client.exceptions.APIError: Service unavailable: Our services aren't available right now. We're working to restore all services as soon as possible. Please check back soon.

Any additional thoughts?

TomAugspurger Jul 27, 2023

row-filtering is only supported for pyarrow engine, although not possible to set this engine. geopandas/dask-geopandas#235 in dask-geopandas is also discussing this partly and you've responded there as well. I had a quick look at the GeoArrowEngine and I guess it is using pyarrow in some form.

This is correct. geopandas does subclass to Arrow for most of the interaction with parquet.

There's a small typo in your read_parquet. It should be filters=, not filter=. Unfortunately, an error isn't raised :/

FYI, I wouldn't expect the number of partitions to change. I think there's one partition per parquet file (maybe there's a way to split by row groups). You might end up with empty partitions if every row is filtered out by the row filter.

We're doing some work to track down the pystac_client.exceptions.APIError. In the meantime, https://pystac-client.readthedocs.io/en/stable/usage.html#configuring-retry-behavior has some guidance for enabling retries on 500 errors like that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentinel GeoParquet divisions #238

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Sentinel GeoParquet divisions #238

sfalkena Jul 12, 2023

Replies: 1 comment · 2 replies

TomAugspurger Jul 12, 2023

sfalkena Jul 14, 2023 Author

TomAugspurger Jul 27, 2023

sfalkena
Jul 12, 2023

Replies: 1 comment 2 replies

TomAugspurger
Jul 12, 2023

sfalkena Jul 14, 2023
Author