Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetCDF global attributes vs data variable local attributes #3325

Closed
bjlittle opened this issue Jun 5, 2019 · 32 comments · Fixed by #5152
Closed

NetCDF global attributes vs data variable local attributes #3325

bjlittle opened this issue Jun 5, 2019 · 32 comments · Fixed by #5152

Comments

@bjlittle
Copy link
Member

bjlittle commented Jun 5, 2019

Update Mon 3rd July:
this effort is now handled in it's own project
please see there for existing task breakdown + progress


At the moment iris takes a rational but naive approach to dealing with the local attributes of a NetCDF variable (a variable that becomes a cube) and the global attributes of the NetCDF file that the said variable comes from.

That is, the resultant cube.attributes will be a combination of both the local and global attributes, where the local attributes will take precedence, and overwrite, common global attributes.

From the inception of iris, and in the light of no use cases, this seemed like a reasonable thing to do. However, such an approach prevents preservation of the local and global attributes metadata. This is a major issue for many users, who require to preserve all attribute metadata.

We require to resolve this issue now in iris once and for all 😄

Note that, if a solution to this issue was implemented, then it would most likely be a breaking change - caution is needed here.

This is somewhat tangential related to #2352

@bjlittle
Copy link
Member Author

bjlittle commented Jun 5, 2019

Ping @zklaus 👍

@zklaus
Copy link

zklaus commented Jun 5, 2019

A non-breaking way to implement that may be to introduce two new class members global_attributes and local_attributes (or var_attributes) responsible for the respective attributes with a property replacing the current attributes that can mimic the current behaviour for reading, overwrite existing entries in the respective dictionary, and otherwise defaulting to the local one.

@bjlittle
Copy link
Member Author

bjlittle commented Jun 6, 2019

@zklaus I was thinking along the same lines. In summary:

  • cube.local_attributes is a mutable dictionary for defining the attributes that are local to the associated NetCDF data variable of the cube when it's written to a NetCDF file
  • cube.global_attributes is a mutable dictionary for defining the attributes that are global to the associated NetCDF file that the data variable of the cube is written to
  • cube.attributes is a (stateless) combination of the cube.global_attributes and cube.local_attributes, akin to now, where the cube.local_attributes have priority over the cube.global_attributes

So for the case where there is a common attribute shared between cube.global_attributes and cube.local_attributes, the local value is shown in the cube.attributes. When saving such a cube to NetCDF, then the common local and global attribute is preserved (hoo-rah) i.e. the local on the NetCDF data variable of the cube, and the global in the global scope of the NetCDF file.

To ensure that there is a non-breaking behaviour here, then I think that I'm right in saying that if a user writes to the cube.attributes then this state is captured in the cube.global_attributes. Note that, cube.attributes is stateless, in the sense that it is simply derived on the fly at run-time from both cube.local_attributes and cube.global_attributes, with local having priority over global. This means that if a user wants to associate an attribute to the NetCDF data variable of the cube, then they must explicitly add the attribute to the cube.local_attributes, and not the cube.attributes.

Hmmm.... this make sense to me. Thoughts?

@bjlittle bjlittle assigned bjlittle and unassigned bjlittle Jun 6, 2019
@zklaus
Copy link

zklaus commented Jun 11, 2019

Exactly what I was thinking!

@jonseddon
Copy link
Contributor

We bumped into this ticket while looking at a related project. Can I check what would happen when you save a cubelist rather than a cube? If cubes in the cubelist had different values of the same global attribute, how would this be saved? If this saved netCDF file was loaded back into Iris would we get the same cubelist and if we didn't, would this matter?

@bjlittle
Copy link
Member Author

bjlittle commented Aug 1, 2019

@jonseddon Good question...

Clearly, if there is a global attribute conflict across Cubes in the CubeList, then it wouldn't be possible to save any such conflicted attribute to the netCDF file.

However, it begs the question whether a CubeList should also have state for overriding attributes on save. Again, careful consideration is required here to understand the overall behaviour and whether that's appropriate. In particular, consideration is required also for loading from multiple different netCDF files...

To be honest, I'd opt to separate concerns here. I'd see the debate about CubeList having attributes state as an extension to this proposal - but there should certainly be clarity for the behaviour when there is a conflict for global attributes of Cubes in a CubeList as it stands here.

@ehogan
Copy link
Contributor

ehogan commented Aug 1, 2019

I had a question about this :) Does it make sense to add global attributes and variable attributes, which I would argue are netCDF-specific concepts, to the cube, which is meant to be format-agnostic?

@zklaus
Copy link

zklaus commented Aug 1, 2019

Re cubelists: The question is certainly a good one. Note that right now there is no guarantee that loading a single cube from a netcdf file and saving it again will give you the same file. For example,

  • if there is a global attribute comment and a local attribute comment, the local attribute will essentially overwrite the global one and end up as the only comment attribute in the final file in the global section
  • a global attribute that is one of the special local attributes in iris will end up in the variable section
  • a local attribute that is not recognized as special will end up in the global section.

Seeing as it seems to me that a lot (most?) data has one variable per file (notably of course all cmip and cordex data) I am not sure I would be worried about consistency in cubelist storing so much, at least until we have better consistency in cube storing. Though it is certainly a good idea to keep this in mind so as not to make unnecessary outright contradictory decisions.

Re format agnosticism: That is certainly a nice goal, but maybe it needs better definition? It seems that attributes in and of themselves are unsupportable in, eg grib and derived formats. Surely we don't want to abandon them completely. So, is there a format that has attribute support, is supported by iris, and could not be made to work with this model?

@bjlittle
Copy link
Member Author

bjlittle commented Aug 1, 2019

@ehogan From a purely idealistic perspective, I'd agree with you. A Cube should be format agnostic. However, in reality, that's not really the case.

For me it's an intention at best, rather than a hard and fast rule. Consider the special way that we handle PP STASH and NetCDF var_name. These fileformat specifics have crept into the way that we deal with cubes and coordinates, along with other CF-isms that may only make sense for NetCDF. The reason this has happened is that there is tangible utility or benefit behind it - so it's a common sense compromise in my opinion. However, we do try hard not to dilute our attempts to be as agnostic as possible.

I don't know if this helps answer your question...

@pp-mo
Copy link
Member

pp-mo commented Aug 2, 2019

Consider the special way that we handle PP STASH and NetCDF var_name

Actually we should have something in GRIB space too, but we don't.
Just plugged a suggestion here : SciTools/iris-grib#153

@bjlittle bjlittle modified the milestones: v2.3.0, v3.1.0 Nov 13, 2019
@pp-mo
Copy link
Member

pp-mo commented May 23, 2022

my attempt at describing what it currently does

Sorry for delay, the code is more baroque than I knew !
Here's my best account
( but may still be wrong in places : maybe worth testing some of these ideas, rather than trying to deduce truth from code )

The code divides possible attributes into various different classes of interest

  1. any "Conventions" attribute
  2. those "specially handled", listed in '_CF_ATTRS'
  3. those "variable-only", listed in '_CF_DATA_ATTRS' and '_UKMO_DATA_ATTRS'
  4. those "global-only", listed in '_CF_GLOBAL_ATTRS'
  5. those which are the same in all cubes saved at one time
  6. everything else (attributes not on all cubes, or not the same on all cubes)

(1) Class 1 includes either "conventions" or "Conventions" . These are always ignored; we never write a "local" conventions attribute; we always write a global Conventions = "CF-1.7".

(2) Class 2 have interpreted CF/netcdf meaning handled specially by the loader, e.g. "_FillValue", "valid_min", "standard_name".
All of these can never appear as attributes of loaded cubes.
We also prohibit setting them via a Cube.attributes being a LimitedAttributeDict. Unfortunately the 'forbidden' list here is not exactly the same, which in some cases clearly makes sense but I fear this scheme could have some holes in it.
So, these will get written as data variable attributes as required by save rules, but not from being in an attributes dict (e.g. "standard_name", "_FillValue")

(3) Class 3 is those which only make sense as local attributes applying to a particular data-variable, e.g. "flag_meanings"
It does not include things like "valid_min", "_FillValue" : those are in class 2 (broadly?)
So, these are always local

(4) class 4 are handled like any other non-local-only (non-class 3) attribute, except that if they get written as local attributes we also raise a warning about it.
So, these are expected to always be global, but will be created as local attributes if required.

(5) attributes which are not in class 2/3, and appear on all saved cubes.
--> Saved as file global attributes.

(6) all others
--> Saved as local attributes, i.e. as attributes of the (cube) data variable, instead of file global attributes

NOTE if there is only 1 cube, class (6) does not really happen, since (5) dominates, except for the specific cases in classes 1/2/3 : This is the same as all cubes having the same attribute.
So the general principle is that everything that can be global will be.

@pp-mo
Copy link
Member

pp-mo commented May 24, 2022

IMHO, what comes out of this ...
My main take-homes are :

  1. loading will naturally assign attributes into the new global/local categories
  2. when reproducing the existing save behaviour, it certainly isn't possible to follow those initial assignments exactly
  3. .. but ... even in the new world "non-legacy mode", possible collisions between attributes recorded as "global" can still reasonably occur
    • either by combining results loaded from different sources, or from other actions like creating cubes from scratch
    • In such cases, we should revert to storing them as "local", with an appropriate warning
  4. We probably also need to retain the existing notions that some specific attributes are expected to be global-only, and some are local-only.
    • probably still worth warning about unexpected usages
    • but also, there may be scope for some changes here to be treated as "bug fixes"
    • In this context, we should definitely also consider @larsbarring comment above, that having the same attribute recorded as both local and global, can be useful, and is seen in practice -- despite possibly being contra to the intentions of CF.
      possibly even for recognised attributes like "history".
      That could also suggest possible changes to the existing behaviours
  5. when assigning any-old generic attributes via the legacy interface (e.g. cube.attributes['odd'] = 1), I guess that 'local' should be the default, that seems sensible. However, this will be at odds with existing behaviour, at least for a single cube. I guess in the new world, we would expect global attributes to be the exception -- you have to ask for that explicitly. Which at least gives less opportunity for conflicts.

@larsbarring
Copy link
Contributor

larsbarring commented May 24, 2022

Following up on what @pp-mo just wrote

... having the same attribute recorded as both local and global, can be useful, and is seen in practice -- despite possibly being contra to the intentions of CF.

Some time ago I had a conversation with some CF persons regarding what was/is actually meant with the sentence I was citing

When an attribute appears both globally and as a variable attribute, the variable’s version has precedence.

Whether it is the actual existence of same attribute [name] at both places, or the content of the of the two attributes being contradictory. While we did not came to a definitive conclusion we agreed that it was not the intention to only consider the attribute names and totally disregard their content.

This would not be the only place where CF is not crystal clear regarding the distinction between "name" and "content" or "spelling" and "meaning". I think that it is useful to think back what was the common situation about 20 years ago (or more), when the first version CF was produced. Then the common practice was that analysts manually looked at files and their content (ncdump style) before doing analyses. Then it was just obvious what to do and how to interpret the data, e.g. that standard and gregorian calendars were the same, and that `proleptic_gregorian`` also is the same for recent centuries, and that an attribute existing both as local and global may either contradict or complement each other (or even both) depending on the content, and how to make use of that information. I believe there are many other places where the same kind of ambiguity may exist in the CF conventions document.

@wjbenfold
Copy link
Contributor

Question to the users in the room: what behaviour do you want from merge / concatenate / other operations that make a new cube at the end? When we would make the choice as to whether (for example) a mismatch of these attributes should block merge, we won't know if they're intended to be used (and if they're not then users shouldn't be blocked on that merge).

We could

  • require a context manager around the load step to cause these attributes to be captured, then have them merge/concatenate in a manner akin to normal attributes.
  • automatically capture these attributes, but then drop them as soon as a new cube is being made, and leave it to the user to add them in to the resultant cube.
  • drop these attributes during merge etc. unless there's a context manager set up around the merge.
  • come up with more options.

There's also the option (which is simpler, and therefore more quickly implemented, but correspondingly less useful) of setting up a way that the user can access the full local/global attribute information at load time, hold it themselves and then specify its application to the resultant cubes at save time. Would that appeal?

@larsbarring
Copy link
Contributor

As one "user in the room" I venture some some thoughts:

  1. It seems useful to keep a "legacy mode" as an alternative to a "new mode" (or maybe even several alternative new modes, like simple/advanced...)
  2. Roundtrip consistency: loading a simple cube and then saving it should give the same file and then the same cube when again loading from the new file (cf. @zklaus's comment). The same for cube lists based on one file.
  3. If I try to come up with a general (and ugly) use case. Assume that we have three files each containing the same two variables, that eventually would be merged into two cubes, or a cubelist of two cubes:

File 1 (covering time period 1):

        my_var1 // has no comment attribute
        my_var1: history = "CMOR changed units from X to Y"
        my_var2: comment = "this data is very good"
        my_var2: history = "CMOR changed units from Z to W"
 // global attributes:
                 :comment = "generally all this data is good"
                 :history = "CMOR made the coordinate system information conformant with ZMIP requirements"

File 2 (covering time period 2):

        my_var1: comment = "this data is good"
        my_var1: history = "CMOR changed units from X to Y"
        my_var2 // has no comment attribute
        my_var2: history = "CMOR changed units from Z to W"
 // global attributes:
                 :comment =  "generally data for this period is not so good"
                 :history = "CMOR made the coordinate system information conformant with ZMIP requirements"

File 3 (covering time period 3):

        my_var1: comment = "this data is good"
        my_var1: history = "CMOR changed units from X to Y"
        my_var2: comment = "this data is good"
        my_var2: history = "CMOR changed units from Z to W"
 // global attributes:
                 :comment = "generally all this data is good"
                 :history = "CMOR made the coordinate system information conformant with ZMIP requirements"

That is, someone first made an overall assessment of data quality and found that "generally all this data is good". Later someone looked in more detail and found that that some periods were less reliable and furthermore that this varied between variables.

While data quality preferably should be recorded as ancillary quality flags, this example illustrates that it is possible (and perhaps reasonable) to have some kind of complementary (or perhaps hierarchical) global and local attributes. Furthermore it illustrates that an [advanced?] user might want to know the (possibly multidimensional) order in which files were concatenated/merged. I touched on this in a previous comment on another (now closed) issue.

Anyway, these are some thoughts that I realise are maybe not so easy to implement .

@pp-mo
Copy link
Member

pp-mo commented Sep 7, 2022

Update : my view, as-at 2022-09-07

I feel we already reached a point here were we mostly agree on principles, but then dropped the ball in that we (Iris devs) should have produced a proposal.
I'm now aiming to do that soon ...
I'll either get stuck on something requiring further discussion, or I'll produce an actual PR
(which however could well prove contentious!)

I've been prompted too by discussions elsewhere, on iris/xarray inter-conversion, which again turned up some of the known roundtrip problems that Iris has.

@trexfeathers
Copy link
Contributor

The level of outstanding work means this isn't going to make it into Iris v3.4. We will however keep up the momentum over the next few weeks to see this gets finished without going stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Status: 💰 Finished
Status: 🏁 Done
Status: 🏁 Done
Status: Done
Development

Successfully merging a pull request may close this issue.