-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load collection-level assets into xarray #90
Conversation
Thanks @TomAugspurger !
Seems reasonable to me!
Makes sense. Maybe work here could resolve #59
Not really, but definitely have a look at #75 and in particular proposed changes for potentially deviating from the intake-xarray dependency to have drivers defined with intake-stac (https://github.com/intake/intake-stac/pull/75/files#diff-b45fa0c9c70f45ce9661f18946a5a2aed632ac4c1d3b1c09333291f77bbdfda6). For the specific case of Also, just want to note this PR addresses #59 |
I pushed an update so this is a bit simpler. I see that My main question now is around what to call this method. It's really doing two things:
I've called this |
Having some second thoughts about the API design around selecting an asset, and I wonder if anyone else has thoughts. We can't use Then the question is: do we have a separate method to get an asset, followed by a my_asset = my_collection.get_asset(asset_key) # type: StacAsset
ds = my_asset.to_dask() # type: xarray.DataArray, dask.dataframe.DataFrame, etc. or do we put the ds = my_asset.to_dask(asset_key) # type: xarray.DataArray, dask.dataframe.DataFrame, etc. I suppose that the first option, a |
Thinking about this more, I think at least one more STAC extension is appropriate to capture this information. These would be an extension of of the STAC collection and Item I want to capture everything necessary to go from STAC Asset to xarray Dataset within the STAC catalog itself. Essentially, asset = stac_catalog.assets[key]
store = fsspec.get_mapper(asset.href, **storage_options)
ds = xr.open_zarr(store, **xarray_open_kwargs) So there are two pieces of information to capture:
We could have two new extensions: "zarr-abfs": {
"href": "abfs://daymet-zarr/daily/hi.zarr",
"type": "application/vnd+zarr",
"title": "Daily Hawaii Daymet Azure Blob File System Zarr root",
"description": "Azure Blob File System of the daily Hawaii Daymet Zarr Group on Azure Blob Storage for use with adlfs.",
"roles": [
"data",
"zarr",
"abfs"
],
"xarray:storage_options": {
"account_name": "daymeteuwest"
},
"xarray:open_kwargs": {
"consolidated": true
}
}, |
See https://github.com/tomAugspurger/xarray-assets for a proposal. I don't really know how valuable that is, but I think it's worth exploring a bit. If that extension is present, then I think intake-stac could use it like https://github.com/tomAugspurger/xarray-assets#python-example to safely go from a STAC Asset -> xarray.Dataset without any arguments from the user. |
if isinstance(result, DataSource): | ||
kwargs = result._captured_init_kwargs | ||
kwargs = {**kwargs, **dict(storage_options=storage_options), **open_kwargs} | ||
result = result(*result._captured_init_args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martindurant currently, StacItem.__getitem__
will return a (subclass of) DataSource
. Does this seem like the right way to control the parameters passed to that DataSource? If so, are _captured_init_args
and captured_init_kwargs
considered "public"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks essentially the same as DataSourceBase.configure_new
(aliased with get
for compatibility, and __call__
), but yes, seems fine to me.
are _captured_init_args and _captured_init_kwargs considered "public"
They were means for internal storage and to be able to recreate things after serialisation, possibly to YAML. They are more "automatic" than "private", I think.
Does this seem like the right way
Unless configure_new
already does the right thing.
I do wonder what result
can be if not a DataSource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless configure_new already does the right thing.
Gotcha. I think configure_new
doesn't quite work, since we want to merge these keywords with the "existing" ones that are in ._captured_init_args
(we had a test relying on that anyway).
I don't see an easy way for configure_new
to add a keyword to control whether or not to merge the new kwargs, since it's passing all the keywords through, there's the potential for a conflict.
I do wonder what result can be if not a DataSource.
In this case, perhaps a StacAsset
, but I might be misunderstanding intake-stac's design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting for posterity, intake-xarray's datasources define a .kwargs
and .storage_options
properties. We can't use those because they apparently aren't implemented by RasterIOSource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately i don't really follow this... i've always been a little confused about what should be handled by intake-xarray or whether intake-stac should just be stand-alone and define all the datasources under this repo. I sort of started down that road in https://github.com/intake/intake-stac/pull/75/files#diff-b45fa0c9c70f45ce9661f18946a5a2aed632ac4c1d3b1c09333291f77bbdfda6 but abandoned it...
The latest commit implements the API described in #90 (comment). So now users call In [2]: import intake
In [3]: collection = intake.open_stac_collection("https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-annual-hi")
In [4]: source = collection.get_asset("zarr-https")
In [6]: source.kwargs
Out[6]: {'consolidated': True}
In [7]: source.to_dask()
Out[7]:
<xarray.Dataset>
Dimensions: (nv: 2, time: 41, x: 284, y: 584)
Coordinates:
lat (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
lon (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
* time (time) datetime64[ns] 1980-07-01T12:00:00 ... 20...
* x (x) float32 -5.802e+06 -5.801e+06 ... -5.519e+06
* y (y) float32 -3.9e+04 -4e+04 ... -6.21e+05 -6.22e+05
Dimensions without coordinates: nv
Data variables:
lambert_conformal_conic int16 ...
prcp (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
swe (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
time_bnds (time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
tmax (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
tmin (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
vp (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
Attributes:
Conventions: CF-1.6
Version_data: Daymet Data Version 4.0
Version_software: Daymet Software Version 4.0
citation: Please see http://daymet.ornl.gov/ for current Daymet ...
references: Please see http://daymet.ornl.gov/ for current informa...
source: Daymet Software Version 4.0
start_year: 1980 |
Looks like the narrative docs are a bit out of date, but f1dc6ff added a small section on xarray-assets to the docs. @kthyng did you already have STAC items / collections I could test this against? Or were you waiting for intake-stac to be updated before generating those? @scottyhq do you have a chance to take a look at this? |
@TomAugspurger You mean a catalog already set up to use |
That should just require adding the extension's URL to the Catalog / Item's
If you're generating STAC metadata for Zarr datasets, https://github.com/TomAugspurger/xstac might be helpful, or you can generate it "by hand". |
@@ -230,6 +283,20 @@ class StacItem(AbstractStacCatalog): | |||
name = 'stac_item' | |||
_stac_cls = pystac.Item | |||
|
|||
def __getitem__(self, key): | |||
result = super().__getitem__(key) | |||
# TODO: handle non-string assets? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i haven't come across this in the wild. are they always strings? here for example I see asset["0"] https://cmr.earthdata.nasa.gov/stac/NSIDC_ECS/collections/NSIDC-0723.v4/items
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently it's possible to look up multiple items by passing a tuple to __getitem__
. https://github.com/intake/intake/blob/d9faa048bbc09d74bb6972f672155e3814c3ca62/intake/catalog/base.py#L403
I haven't used it either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this forward @TomAugspurger! I think this will be a great addition, left some comments for some minor suggested changes, then we should merge it!
@TomAugspurger Thanks for the help, that was really clear. I am meeting an issue I think due to using For code here:
|
I'm currently working with netcdf files and couldn't tell if I should be using |
Thanks @scottyhq, updated to address your comments. @kthyng I'll take a closer look later, but I think you can update properties['xarray_kwargs'] = {'drop_variables': 'obs'}
item.add_asset(
key=item.id,
asset=pystac.Asset(**asset)
) Hopefully that does the trick. I haven't tried xstac on a NetCDF file yet. I'll give that a shot tonight or tomorrow and add will add it as an example! |
@TomAugspurger thanks for the suggestion but unfortunately that hasn't worked for me. Specifically, it has to go into Here's what I mean at the point it gets to the intake GUI. The but I think they need to be available under "args" (second image) to be used in |
This is a prototype for loading collection-level assets from a STAC collection. If you want a full example, install the
main
branch of pystac:It's not quite ready, but I have a few points of discussion:
zarr-https
) in the example above. The STAC spec doesn't give those any meaning really, but they're used in other places (e.g.stackstac.stack(..., assets=[])
so I think we're OK.xarray.open_dataset
if it had that media type. Right now we're only supporting Zarr.storage_options
likeconsolidated=True
, rather than the user. Does intake-stac do anything like that today?Closes #59
Closes #70