Replies: 1 comment 2 replies
-
A couple of quick thoughts:
In the meantime, I think your best bet is to filter the filenames based on the date range your interested in. I agree it's ugly, and it might break in the future if we change the naming scheme, but it's the best we have for now to filter out partitions. Once you've figured out which partitions you want, then you'll want to apply a row filter as well: import dask_geopandas
s2l2a = dask_geopandas.read_parquet(
asset.href, storage_options=asset.extra_fields["table:storage_options"], gather_spatial_partitions=False, filters=[("eo:cloud_cover", "<", 10)],
)
s2l2a.head() And you can do your clip / mask based on your polygon. Ideally by that point each partition will fit in memory. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi Planetary Computer team,
I love the geo-parquet interface that the datasets have, as it allows me to filter down the STAC data in an efficient manner for my existing solution for the NAIP dataset.
Going along, I wanted to implement the same workflow for Sentinel2 as well, but as Sentinel has way more assets, the parquet files have been partitioned.
Generally what I do for NAIP is:
However, as for Sentinel I load the parquet with dask-geopandas:
which will yield me a dask-geodataframe. The intersection of my area of interest (polygon) is not possible without computing the dgdf. What I want to achieve is filter down the dgdf by my time range and cloud coverage (ideally also filter for certain bands), then compute the dataframe and do my intersection.
Currently, my understanding is that there is two ways to filter for time:
Is there any option/workflow to reduce the dask-geodataframe in an easy way so that I end up with my items of interest?
My suggestion would be for the planetary computer interface to implement divisions into the dataframe (currently dgdf.known_divisions = False), so that I can query the dask-geodataframe efficiently with df.loc['2015-01-20': '2015-02-10'] without computing the dataframe.
If you have any other suggestion or workflow that I might have missed, please let me know.
Beta Was this translation helpful? Give feedback.
All reactions