Allow cubes/coords/etc to share data #3172

DPeterK · 2018-09-17T09:37:38Z

In some cases it is highly advantageous for cubes to be able to share a data object, which Iris currently cannot handle. This means Iris can in some cases produce views of data and not copies.

Here's @pp-mo's take on this topic:

IMHO we should aim to be "like numpy".
In this context, that means in the worst cases (e.g. indexing) :

"Result is usually a view, but in some cases a copy."
"it's too complicated to explain exactly when."
"it might change in future releases"

There is some prior on this topic, including #2261, #2549 and #2584, #2681 and #2691 . These reflect the importance of this topic. However, given the potential for unexpected behaviour that this change will bring, further thought is still required.

pp-mo · 2018-09-17T14:52:58Z

unexpected behaviour

Some key points from my prior thought on this ...

the key practicality + API design question is "in what context may an Iris operation produce a result which shares data with another Iris object".
the key goal is to control it so it only happens when you expect it or asked for it.
lazy content could confuse this : when does it get evaluated, can that encapsulate a behaviour switch from when it was created (e.g.) ?
the biggy IMHO : Once it is anyway possible to have (e.g.) 2 cubes which share some data, then any operation which can modify its inputs might produce different results. You just can't logically avoid that. Even something as simple as "a = a + b" is potentially affected.

pp-mo · 2022-02-01T09:44:33Z

Iris 3.2 and the unstructured data model

Since v3.2 / unstructured, we do finally get cubes which share some components : that is, any cube.mesh

Summary of some relevant facts about new datamodel objects

Basic relevant facts + ideas

cubes loaded from the same file can share a mesh, even if they map different locations
copying a cube with a Mesh results in a cube with the same mesh
- because of the way that MeshCoords copy
- -- we don't actually provide a Mesh.copy() anyway. Since it is not modifiable, not clear why you would
slicing+indexing a cube currently loses the Mesh, instead of linking/copying
- because MeshCoords won't slice, and cube indexing converts a Coord thats fail to slice into an AuxCoord
- however, in future, it may well make sense to have sub-indexing create a cube with a location-index-set
  - (see : What remains for complete UGRID support #4438)
if we do implement location-index-sets, from the Cube perspective they would simply be equivalent to a mesh
- so, they would logically be shareable in the same way as a Mesh

Mesh

does not support copy : we expect multiple things that use it to cross-refer
is mapped to only one Cube data dimension, only via a MeshCoord, and therefore not :
- a cube component (like Coord/Ancil/CellMeasure)
- a _DimensionalMetadata subclass
- indexable as part of sub-indexing a cube

Meshcoords

Are a sort of "convenience" component ..

they "just" represent a relationship between a cube (and its dims) and a Mesh
they are AuxCoords, but don't represent anything in a CF dataset
- thus, they have standard/long/varname and units/attributes ..
- .. but these are basically non-functional, don't "mean" anything, aren't used for anything
- so, there is clearly an argument for these to not be AuxCoords but some distinct, more limited class : the current arrangement is pragmatic (as for Connectivity being a _DimensionalMetadata -- see below).
they are not shared between cubes (but in future could be, if any Coords are ?)
they support copying, and are copied on cube copy
they do not support sub-indexing ..
- .. but are replaced with ordinary AuxCoords on cube indexing (see above)

Mesh location coordinates and Connectivites

are not attached to a cube, or its dims, but only to the Mesh
therefore, implicitly, shared + not copied (between cubes of the same mesh)
so, like a Mesh, they aren't a cube component ..
.. but they are dimensional, and mapped to a Mesh dimension
unlike MeshCoords, they do represent objects in a CF dataset
- so they do have meaningful standard/log/var-name + units + attributes
Mesh location coordinates : are just ordinary AuxCoords (for now at least)
Mesh Connectivities : at present are a subclass of _DimensionalMetadata
- but this is not logical, really just a convenience / anomaly and could reasonably change
- so .. they are in principle indexable and copyable, but this is not really useful or used anywhere at present

Sharing of dimensional components (potentially big arrays)

This is a relevant issue, simply because unstructured data comes with a lot of associated mesh information : large coordinate + connectivity arrays
Typically, much larger than structured equivalents for the same size of data

Mesh Coordinates and Connectivities are effectively shared between cubes, since they belong to the Mesh, which also is.
-- though, identical meshes loaded from different files cannot currently be identified and shared

Any related AuxCoord/CellMeasure/Ancil on the unstructured dimension can not be shared
They can be lazy, of course, but each Cube will have it's own copy

like regular (structured data) Coords
unlike the Mesh coords + connectivities

pp-mo · 2023-09-15T16:30:41Z

Discussed briefly offline with @hdyson, since he and IIRC @cpelley were the original users most concerned about the inefficiency of this.

His recollection of what "the problem" to be addressed was, was somewhat different ...
He thinks it was in the context of combining multiple results into a single array to then be saved, rather than to do with sharing of components in loaded data.

The thing is, sharing of partial data arrays by multiple cubes is already possible
For example:

>>> data = np.zeros((10,))
>>> c1, c2, c99 = Cube(data[:5]), Cube(data[5:]), Cube(data[4:8])
>>> c1.data[3] = 7
>>> c2.data[:4] = 99
>>> c99.data[:] = 50
>>> data
array([ 0.,  0.,  0.,  7., 50., 50., 50., 50., 99.,  0.])
>>> c1.data
array([ 0.,  0.,  0.,  7., 50.])
>>> c2.data
array([50., 50., 50., 99.,  0.])
>>> c99.data
array([50., 50., 50., 50.])
>>>

pp-mo · 2023-09-15T17:06:10Z

In the course of the above discussion, I rather revised my thoughts.

My understanding is that the major opportunity for inefficiency is where multiple cubes contain identical components, such as aux-coords, ancillary-variables or cell measures.
It doesn't really apply to cube data, since we don't generally expect cube data to be linked.

If all those cube-components' data may be realised, then there is an obvious inefficiency.
( e.g. there was a period when saving cubes realised all aux-coords -- though that is now fixed).
If these contain real data, then this could easily be shared, as the above cube data examples show.
However, normally, when loaded from file, these components would contain multiple lazy arrays, referencing the same data in the file.

So, in the lazy case, it is quite possible that some cube operations might load all that data, or at least transiently fetch it multiple times (e.g. within computation of a lazy result, or a save).
I think there is no clean way to "link" the separate lazy arrays, but it should be possible for the cubes to share either the cube components themselves -- i.e. the objects, such aux-coords -- or, within those, their DataManager's. Effectively, this is already happening with Meshes.
With that provision, realising the components would "cache" the data and not re-read it (still less allocate additional array space). However, that in itself would still not improve lazy operations, -- including lazy streaming during netcdf writes -- since dask does not cache results, and the lazy content would still be re-fetched multiple times.
To address that, It would be possible to implement a caching feature within NetCDFDataProxy objects, but that approach is not very controllable -- and could itself cause problems, if the total data size of a single object is large (in which case, storing only one chunk at a time may be highly desirable).

In short, we may need to focus more carefully on what the common problems cases actually are, since I think there has been some confusion here in the past, and all the solutions so far proposed may have potential drawbacks.

DPeterK added Experience: High Status: Decision Required labels Sep 17, 2018

DPeterK mentioned this issue Sep 17, 2018

Sharedata #2691

Closed

rcomer mentioned this issue Jan 20, 2021

Changing values in iris cube based on coordinates instead of index #3948

Closed

pp-mo mentioned this issue Apr 30, 2022

Support lazy saving #4190

Closed

pp-mo mentioned this issue Nov 1, 2022

Accept new copy behaviour from dask/dask#9555. #5041

Merged

trexfeathers mentioned this issue Nov 7, 2022

What makes the NAME loader faster than the NetCDF loader? #5053

Closed

trexfeathers added this to 🚴 Peloton Jun 23, 2023

trexfeathers added the Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info label Jul 10, 2023

trexfeathers added this to 🐉 Dragon Taming Jul 10, 2023

trexfeathers changed the title ~~Allow cubes to share data~~ Allow cubes/coords/etc to share data Sep 15, 2023

trexfeathers moved this to 📌 Prioritised in 🐉 Dragon Taming Sep 15, 2023

scitools-ci bot removed this from 🚴 Peloton Dec 15, 2023

scitools-ci bot added this to 🚴 Peloton Dec 15, 2023

pp-mo mentioned this issue May 30, 2024

Implement copying of meshes #5982

Open

trexfeathers mentioned this issue Nov 8, 2024

Loading a netCDF file with multiple variables is very slow #6223

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow cubes/coords/etc to share data #3172

Allow cubes/coords/etc to share data #3172

DPeterK commented Sep 17, 2018 •

edited

Loading

pp-mo commented Sep 17, 2018 •

edited

Loading

pp-mo commented Feb 1, 2022 •

edited

Loading

pp-mo commented Sep 15, 2023

pp-mo commented Sep 15, 2023 •

edited

Loading

Allow cubes/coords/etc to share data #3172

Allow cubes/coords/etc to share data #3172

Comments

DPeterK commented Sep 17, 2018 • edited Loading

pp-mo commented Sep 17, 2018 • edited Loading

pp-mo commented Feb 1, 2022 • edited Loading

Iris 3.2 and the unstructured data model

Summary of some relevant facts about new datamodel objects

Basic relevant facts + ideas

Mesh

Meshcoords

Mesh location coordinates and Connectivites

Sharing of dimensional components (potentially big arrays)

pp-mo commented Sep 15, 2023

pp-mo commented Sep 15, 2023 • edited Loading

DPeterK commented Sep 17, 2018 •

edited

Loading

pp-mo commented Sep 17, 2018 •

edited

Loading

pp-mo commented Feb 1, 2022 •

edited

Loading

pp-mo commented Sep 15, 2023 •

edited

Loading