Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Design] Integrate xarray-datatree #2352

Open
djhoese opened this issue Jan 13, 2023 · 9 comments
Open

[Design] Integrate xarray-datatree #2352

djhoese opened this issue Jan 13, 2023 · 9 comments
Labels
component:scene documentation enhancement code enhancements, features, improvements future ideas Wishes and ideas for the future question

Comments

@djhoese
Copy link
Member

djhoese commented Jan 13, 2023

Overview

A relatively new package called xarray-datatree introduces a new xarray high-level container called a DataTree. A DataTree contains xarray Datasets which themselves contain xarray DataArrays. In current Satpy our main container object is the Scene which stores xarray DataArray objects in a flat dictionary structure, but with complex DataID objects as keys that support complex dictionary lookups with DataQuery objects. It is my opinion that Satpy could benefit from using DataTree objects in various parts of Satpy. A DataTree could provide an xarray "builtin" (the expectation being that DataTrees are merged into upstream xarray) container that requires little to no special code in Satpy to work with DataArrays/Sets/Trees. This could mean getting rid of the satpy Scene where Satpy is then focused on providing helper functions and accessors on xarray objects for doing the usual tasks that the Scene is used for.

This issue is meant to be a discussion of the various options for where DataTree could be used (on a high-level, not reader/writer internals), whether they are a good idea, and what priority should be given to implementing them. Feel free to comment with other ideas or feedback about the options described below.

The fact that this stuff is being discussed does NOT mean that we're breaking backwards compatibility in Satpy any time soon.

Integration Options

  1. Scene.to_xarray_datatree: Similar to the existing Scene.to_xarray_dataset, we could add a to_xarray_datatree method that converts a Scene's collection of DataArrays to a DataTree object. There are a couple different ways that a DataTree could be organized so this method may need to take keyword arguments to control the grouping of the DataArrays. See next section.
  2. Replace Scene internal containers: Basically replace Scene._datasets: DatasetDict with a DataTree. This should have little to no effect on users as this should be an internal data structure. The question is whether or not this benefits the Scene (and us, the developers). Again this depends on what type of grouping scheme is used. See next section. The other question here is what role does the existing DataID/DataQuery play with a new internal structure like this.
  3. satpy.open_mfdatatree: See this datatree issue for a short discussion on the initial idea of an open_mfdatatree function. In the past some Satpy developers have discussed redesigning the Scene and readers to be more separated. For example, a DataSource that provides data to one or more Reader objects, and these Readers are passed to some type of load functionality that reads/loads/generates the proper DataArrays from files and composites if requested. Along these lines it would be nice to provide a generic open_mfdatatree function that takes a set of files and a reader name and perhaps optionally a set of products to load/create and then a DataTree is returned. Combined with many of the other discussions going on all over the place regarding xarray accessors, or splitting other parts of the Scene up, this would abstract away a lot of the complexity of readers and composites and everything else...but it would also take a lot of control away from Satpy users used to the Scene and reader_kwargs and available_dataset_ids.

Grouping Schemes

  1. By geographic area: There could be a branch/group in the tree for each area/swath. So for ABI you might have a 500m resolution group, a 1km group, and 2km group. Similarly for VIIRS you might have 375m (I band) resolution group and 750m (M band) resolution group (I think that's what those resolutions are). Keep in mind this is not just resolution, but the actual area/swath so if you had a DataTree with two instruments with the same resolutions but different projections there would be a group for each instrument-resolution group. Theoretically this type of structure makes things like resampling easier as we end up resampling individual Datasets (the DataTree groups) instead of having to sort DataArrays by area/swath.
  2. By DataID parameters: A group for resolution, then inside that a group for calibration, then inside that a Dataset with names for each variable/DataArray. The main benefit here is that this essentially removes the need for DataIDs as-is.
  3. Both grouping scheme (GS) 1 and 2?
  4. By DataID or a serialized version of DataID: I think this still produces a flat structure like we have now, with little if any benefits...unless maybe it was combined with GS 1.
  5. Other?

This is all I can think of for now. Let me know what you think.

@djhoese djhoese added enhancement code enhancements, features, improvements question documentation component:scene future ideas Wishes and ideas for the future labels Jan 13, 2023
@mraspaud
Copy link
Member

Very good write up, thanks a lot! I'll try to comment on this soon.

@mraspaud
Copy link
Member

Regarding the integration options, I would start with the readers, so option 3. But on top of implementing open_mfdatatree, that would allow developers to free themselves from the file handler system, which can be a bit too constrained for datasets with multiple interdependent files.

I think that would make for a clearer api boundary between the reader and the scene, with the reader returning a single datatree with all the data. In my mind, this single datatree would contain a lazy representation of all the contents of the read file(s).

@djhoese
Copy link
Member Author

djhoese commented Jan 25, 2023

I think that would make for a clearer api boundary between the reader and the scene

I'm not sure I see how this transition happens. In my original option 3 I was thinking of a new top-level interface to "play around with" that in one way or another returns a DataTree object. Very likely the first implementation would be use the Scene and like a to_xarray_datatree call on that Scene.

You are talking about having readers return a DataTree. I think I would consider this a new integration option 4, but would like get implemented near (alongside) option 2 where the Scene would use that DataTree or a merge of multiple readers' DataTrees as its internal data container.

In my mind, this single datatree would contain a lazy representation of all the contents of the read file(s).

I think we talked about this in the last meeting and that this isn't currently possible due to performance. If we want to move to this in the future (which I'm OK with) then keeping it in mind as we design something that is do-able now is important.

The hard part about doing anything with readers is we have to do it one reader at a time and support both DataTree and DataArray readers in the Scene handling, but with no changes to the user-side of things. In my mind this DataTree integration should be user-facing first and as a way to advertise Satpy to more users and more use cases. I'm not sure changing the internals/design of readers as a first step gets us much unless it is in direct support of one of the other user-facing options.

@pnuu
Copy link
Member

pnuu commented Jan 25, 2023

In my mind, this single datatree would contain a lazy representation of all the contents of the read file(s).

I think we talked about this in the last meeting and that this isn't currently possible due to performance.

Only if we use open_mfdatatree directly. If we use other lower level file interfaces and fill in the DataTree ourselves, the reader code can restrict the access to only the necessary parts. Or if it doesn't kill the performance, use the open_mfdatatree directly. But certainly not for FCI 😅

We wouldn't be able to use open_mfdatatree directly to open SEVIRI HRIT/Native, AVHRR AAPP/EPS, and few other formats in any case.

@djhoese
Copy link
Member Author

djhoese commented Jan 25, 2023

I think what Martin is saying @pnuu is that the Scene would essentially do:

data_tree = load_data_from_reader(reader_name, filenames=filenames)

And data_tree would include everything from the files. So the reader is still handling the opening of the files and calibration and all that stuff. The Scene, however, doesn't care about DataIDs or available datasets, it just gets a huge DataTree back. Similar to how you would expect xarray's open_dataset to give you everything back. The Scene would then be responsible for pulling out the necessary information from that tree when the user asked for it (I guess).

@mraspaud
Copy link
Member

Ok, so what I meant is actually a bit of both.
Obviously, using open_mfdatatree won't work for binary formats, but even for netcdf data we will want a rather similar internal representation of the data in the datatrees. So we will need to implement new engines.
With that, we will be able to decide what to include or not, and most importantly maybe implement optimisations for filling the datatree. The objective would be to have most of the info provided by the files in the datatree, but we could skip things that we feel will not be needed by users.
Another solution would be to use the concept of groups to filter out things of no interest, eg
myreader.open_mfdatatree(modis_files, group="1000m_data") (that brings us back to the grouping scheme though)

@djhoese
Copy link
Member Author

djhoese commented Jan 26, 2023

Is the DataTree returned by these engines/readers in a Satpy scheme or the file's scheme? I mean, Satpy readers currently rename a lot of things like variable names or dimensions.

@mraspaud
Copy link
Member

I was planning on a Satpy scheme

@mraspaud
Copy link
Member

Just for reference, I can confirm that xarray-datatree is at the moment in the process of being included in the xarray repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:scene documentation enhancement code enhancements, features, improvements future ideas Wishes and ideas for the future question
Projects
None yet
Development

No branches or pull requests

9 participants