Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

solar data tweak - targeting SOLARIS-HEPPA-CMIP-4-5 #139

Closed
1 of 2 tasks
durack1 opened this issue Oct 22, 2024 · 19 comments · Fixed by #159
Closed
1 of 2 tasks

solar data tweak - targeting SOLARIS-HEPPA-CMIP-4-5 #139

durack1 opened this issue Oct 22, 2024 · 19 comments · Fixed by #159

Comments

@durack1
Copy link
Contributor

durack1 commented Oct 22, 2024

Issues to solve:

  • Missing license_id entry in source_id view here
  • ?

Files have a valid license identifier but not the "license_id" attribute that is being lifted to populate webpages, e.g.,

$ ncdump -h ../input4MIPs/CMIP6Plus/CMIP/SOLARIS-HEPPA/SOLARIS-HEPPA-CMIP-4-4/atmos/mon/multiple/gn/
v20241018/multiple_input4MIPs_solar_CMIP_SOLARIS-HEPPA-CMIP-4-4_gn_185001-202312.nc | grep license
		:license = "Solar forcing data produced by SOLARIS-HEPPA is licensed under a Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). The data producers and
data providers make no warranty, either expressed or implied, including but not limited to, warranties of
merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information
(including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;

@znichollscr @st-bender @berndfunke ping - just a note for a very trivial clean up in the next version

@st-bender
Copy link

So, in what way would you prefer that field to be populated?

@znichollscr
Copy link
Collaborator

znichollscr commented Oct 22, 2024

We just add a global attribute, "license_id", with value "CC BY 4.0", to all files. The current "license" attribute is fine as it is.

@st-bender
Copy link

st-bender commented Oct 22, 2024

We just add a global attribute, "license_id", with value "CC BY 4.0", to all files. The current "license" attribute is fine as it is.

We can do that of course.
Btw. is there some official documentation for the global metadata?
Such as required and optional attributes together with their format requirements and best practices?
That would be good to have, instead of going back and forth with GH issue comments, you could point people there.

@znichollscr
Copy link
Collaborator

We're trying, but it's a work in progress and lots of things to do. The tool that best captures it (in my opinion) is https://github.com/climate-resource/input4mips_validation. However, as you can see, there's still lots of things we're not capturing (specifically climate-resource/input4mips_validation#73, climate-resource/input4mips_validation#76).

Some more details are here: #15. As you can tell, the rules are fuzzy and hard to trace so I would say that the tool linked above is really the most concrete (because it's written in code, not words).

@durack1
Copy link
Contributor Author

durack1 commented Oct 22, 2024

And sorry @st-bender this is the list of licenses that we are recommending, pick and choose your flavour (of which we are only recommending 1, but could conceivably deal with a CC0 if someone absolutely wanted it)

{
"CC BY 4.0":{
"conditions":"The input4MIPs data linked to this entry is licensed under a Creative Commons Attribution 4.0 International (https://creativecommons.org/licenses/by/4.0/). Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse for terms of use governing CMIP6Plus output, including citation requirements and proper acknowledgment. The data producers and data providers make no warranty, either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law.",
"long_name":"Creative Commons Attribution 4.0 International",
"license_url":"https://creativecommons.org/licenses/by/4.0/"
}
}

As a backstory all modelling groups in CMIP6, aside from 1, went with the CC BY 4.0 license, with a single group going CC0 (see here)

@durack1
Copy link
Contributor Author

durack1 commented Oct 22, 2024

Btw. is there some official documentation for the global metadata?
Such as required and optional attributes together with their format requirements and best practices?
That would be good to have, instead of going back and forth with GH issue comments, you could point people there.

This is a good suggestion, but not quite in place. The best reference is the CMIP6 guidance document, which can be viewed here - for e.g. license_id is not something that we had in CMIP6.. The rest is very similar, as is the DRS/directory structure and filenames

@st-bender
Copy link

st-bender commented Oct 23, 2024

We're trying, but it's a work in progress and lots of things to do. The tool that best captures it (in my opinion) is https://github.com/climate-resource/input4mips_validation. However, as you can see, there's still lots of things we're not capturing (specifically climate-resource/input4mips_validation#73, climate-resource/input4mips_validation#76).

Some more details are here: #15. As you can tell, the rules are fuzzy and hard to trace so I would say that the tool linked above is really the most concrete (because it's written in code, not words).

IMHO, that feels a bit backwards. I believe the specifications should come first. That would enable you to focus on the important and necessary things and to find missing or unclear things that can be adjusted. Then you can write the validation tool and make sure that it does what it is supposed to do.

It doesn't need to be much text, you could extract the list from your code and put it here, for example.

Edit: Especially since everythings is in flux, a simple (draft) list as a basis for discussion seems to be the way to go.

@znichollscr
Copy link
Collaborator

With all due respect @st-bender, I think you are underestimating the problem a bit. For example, here is the dataset forcing specification from CMIP6 (https://docs.google.com/document/d/1pU9IiJvPJwRvIgVaSDdJ4O0Jeorv_2ekEtted34K9cA/edit?tab=t.0#heading=h.cn9f7982ycw6), it's 7 pages alone and relies hugely on the CF conventions (https://cfconventions.org/), which are 260 pages long in their pdf form (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.pdf).

The rules have been written down for a long time. The issue is that words are imprecise (so what is written down is often wrong/self-contradictory) and the rules are very hard to learn from their text form (with the result being that very few people learn them). Having it in code is a) much more precise and b) much more re-usable (you don't need to learn all the rules, you just run the validator and it tells you whether anything is wrong).

If you want to help out with this overall process or join the forcings task team, I don't think either Paul or I would complain. As I said though, I think you are underestimating the task we have. You only have 3 files and they are relatively straight forward, so it feels like the solutions should be simpler. However, we're also trying to support datasets with hundreds of files, where really simple solutions stop working (which is why the CF conventions are 260 pages long).

@st-bender
Copy link

With all due respect @st-bender, I think you are underestimating the problem a bit. For example, here is the dataset forcing specification from CMIP6 (https://docs.google.com/document/d/1pU9IiJvPJwRvIgVaSDdJ4O0Jeorv_2ekEtted34K9cA/edit?tab=t.0#heading=h.cn9f7982ycw6), it's 7 pages alone and relies hugely on the CF conventions (https://cfconventions.org/), which are 260 pages long in their pdf form (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.pdf).

Maybe I am. I just think that code alone is not a good reference. It might have bugs, and in my experience if it is not very carefully written and well documented, it can be hard to read and understand. And just using it as a black box is not very helpful either.

The rules have been written down for a long time. The issue is that words are imprecise (so what is written down is often wrong/self-contradictory) and the rules are very hard to learn from their text form (with the result being that very few people learn them). Having it in code is a) much more precise and b) much more re-usable (you don't need to learn all the rules, you just run the validator and it tells you whether anything is wrong).

I think we disagree here. Inconsistencies in the specifications can be discussed and rectified if they are critical, and I am not assuming that this is easy. I just think that with code as the only reference those are much harder to catch. Also how does the code decide which version of the imprecise language to use?

But anyway, this is getting offtopic and should be moved somewhere else.

@znichollscr
Copy link
Collaborator

But anyway, this is getting offtopic and should be moved somewhere else.

I agree. If you want to continue it, please do. A final thought on this while I'm here is below (which may be a useful starting point)

I hear and understand all the points you're making. If you would like to help the task team, it would be great to have another person who cares about this on board and Paul and I would gladly tell you all of our thoughts on what has and hasn't worked previously, why we think it has/hasn't worked and why we're going the route we're going. I think the specifications in the current code are going in a comprehensible direction, but feel free to take a look yourself (e.g. the current tracking_id validation, creation_date validation). If you're not interested, also fine, Paul and I will keep doing our best and appreciate the time you take to fix things as we start to catch things more and more (for what it's worth, your dataset is super clean, others have been through much more painful re-writes e.g. #123).

@durack1
Copy link
Contributor Author

durack1 commented Oct 23, 2024

@st-bender thanks for engaging, it's useful to discuss this. @znichollscr is right, a challenge for the forcing datasets is that they veer very far away from the single variable per file format of CMIPx. There are reasons that this makes sense, and a lot of legacy in addition (many modelling groups don't want formats to change as they'd have more work to change their post-processing or model codes). This considerably complicates the edge cases that are appearing to interpret and publish.

I agree that having a clearly defined specifications document that works through all the edge cases in an iterative process would be ideal. However, most of the forcing efforts are voluntary, and the ideal is very difficult to realize, in addition to the time pressures that the AR7 and CMIP7 Fast Track impose. If you have feedback on the CMIP6 specifications document (here) please make suggestions, it would be useful to update this to the latest reference alongside @znichollscr input4MIPs-validator tool, which has been a great self-help tool for the data providers that have picked up and used it - it's far more objective and thorough than my or @znichollscr eyes alone!

Again, thanks for the engagement here, and I second the invitation to help us out with this process, the more hands the better the project, and the datasets will be!

@st-bender
Copy link

@znichollscr
FWIW, it looks like all the info I was talking about here is already in dataset, and my mere suggestion was to put that somewhere more accessible in a list or table, something like:

  • required (from metadata_data_producer_minimum.py):

    • grid_label: str, Label that identfies the file's grid, see CMIP6 spec
    • nominal_resolution: str, Nominal resolution of the data in the file, see CMIP6 spec
    • source_id: str, The ID of the file's source, see CMIP6 spec
    • target_mip: str, The MIP that this file targets, see CMIP6 spec
    • creation_date: str, in isoformat "%Y-%m-%dT%H:%M:%SZ"
    • tracking_id: str, format "hdl:21.14100/<uuid>" with uuid in format "aaaaaaaa-bbbb-4ccc-dddd-eeeeeeeeeeee"
  • additionally required for multiple variable datasets (from metadata_data_producer_multiple_variable_minimum.py, and not in the above):

    • dataset_category: str, The file's category, see CMIP6 spec
    • realm: str, The realm of the data in the file, see CMIP6 spec
  • optional (from metadata.py, and not in the other two):

    • activity_id: str, Activity ID that applies to the file
    • contact: str, Email addresses to contact in case of questions about the file
    • license_id: str, abbreviated license notation, e.g. "CC-BY-4.0" for creative commons attribution 4.0, or "CC0" for ...? (Do you want with dashes, with spaces, or both, or without any: CCBY4?)
    • ...

It looks like there are duplicates in metadata.py so I might have looked in the wrong place, but I hope you get the idea.

@durack1
Copy link
Contributor Author

durack1 commented Oct 23, 2024

@st-bender apologies, we should have pointed you to the CVs (controlled vocabularies) that the input4MIPs-validator is using to validate contributions - see PCMDI/input4MIPs_CVs/CVs

@znichollscr
Copy link
Collaborator

It looks like there are duplicates in metadata.py so I might have looked in the wrong place, but I hope you get the idea.

Got it, thanks. Further discussion can roll into here if it's needed: climate-resource/input4mips_validation#77

@st-bender
Copy link

Hi,
We are preparing the files for SOLARIS-HEPPA-CMIP-4-5.
license_id will be added, are there any more updates or changes to the metadata that we should be aware of?
Cheers.

@durack1
Copy link
Contributor Author

durack1 commented Nov 21, 2024

@st-bender I just took a quick peek at your latest monthly file (below), looks pretty good to me. I note that we had discussed whether "fx" was the right frequency for the piControl climatology data.. And I don't have a better suggestion immediately to add.

Out of curiosity, what is the fix/issue that 4.5 will be solving over 4.4? We'll need that description to be caught alongside the 4.4 dataset deprecation and replacement by 4.5.

// global attributes:
		:title = "SOLARIS-HEPPA CMIP7 historic solar forcing (1850-2023)" ;
		:institution_id = "SOLARIS-HEPPA" ;
		:institution = "APARC SOLARIS-HEPPA" ;
		:activity_id = "input4MIPs" ;
		:comment = "The NASA NOAA LASP (NNL) solar variability models were formerly known as the Naval Research Laboratory (NRL) solar variability models. NNL V1 models will become the operational NOAA/NCEI Solar Irradiance Climate Data Record (CDR) V3 in August 2024. The SSI and F10.7 data are taken from V03. Sub-annual variability has been added for the period before 1874; TSI in this file is the integral over SSI from source data between 0 and 100,000nm" ;
		:time_coverage_start = "1850-01-01" ;
		:time_coverage_end = "2023-12-31" ;
		:frequency = "mon" ;
		:source = "SSI, TSI, and F10.7 from ssi_v03r00 (Odele Coddington et al., pers. comm.); Ap and Kp from ftp.ngdc.noaa.gov until 2014, afterwards from GFZ Potsdam (https://kp.gfz-potsdam.de)" ;
		:source_id = "SOLARIS-HEPPA-CMIP-4-4" ;
		:realm = "atmos" ;
		:further_info_url = "http://solarisheppa.geomar.de/cmip7" ;
		:metadata_url = "see http://solarisheppa.geomar.de/solarisheppa/sites/default/files/data/cmip7/CMIP7_metadata_description_4.4.pdf" ;
		:contributor_name = "Bernd Funke, Timo Asikainen, Stefan Bender, Odele Coddington, Thierry Dudok de Wit, Illaria Ermolli, Margit Haberreiter, Doug Kinnison, Judith Lean, Sergey Koldoboskiy, Daniel R. Marsh, Hilde Nesse, Annika Seppaelae, Miriam Sinnhuber, Ilya Usoskin, Max van de Kamp, Pekka T. Verronen" ;
		:references = "Funke et al., Geosci. Model Dev., 17, 1217--1227, https://doi.org/10.5194/gmd-17-1217-2024, 2024" ;
		:contact = "bernd AT iaa.es" ;
		:dataset_category = "solar" ;
		:grid_label = "gn" ;
		:mip_era = "CMIP6Plus" ;
		:target_mip = "CMIP" ;
		:variable_id = "multiple" ;
		:license = "Solar forcing data produced by SOLARIS-HEPPA is licensed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). The data producers and data providers make no warranty, either expressed or implied, including but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;
		:Conventions = "CF-1.8" ;
		:creation_date = "2024-10-14T09:26:01Z" ;
		:source_version = "4.4" ;
		:nominal_resolution = "10000 km" ;
		:product = "derived" ;
		:region = "global" ;
		:tracking_id = "hdl:21.14100/f420da79-7a74-49b3-9693-6412888d1499" ;
}

@st-bender
Copy link

st-bender commented Nov 22, 2024

@st-bender I just took a quick peek at your latest monthly file (below),
looks pretty good to me. I note that we had discussed whether "fx" was the
right frequency for the piControl climatology data..
And I don't have a better suggestion immediately to add.

I have exchanged the files behind the links provided earlier,
they should now contain version 4.5.

Out of curiosity, what is the fix/issue that 4.5 will be solving over 4.4?
We'll need that description to be caught alongside the 4.4 dataset
deprecation and replacement by 4.5.

Our website updates will soon be underway, stating the reason:

  • issues with MEE short-term variability have been resolved, no changes to SSI/TSI

@znichollscr
Copy link
Collaborator

I have exchanged the files behind the links provided earlier, they should now contain version 4.5

Perfect thank you. @durack1 these look good to be put in the publishing queue to me. Links are:

Our website updates will soon be underway, stating the reason

Thank you

@znichollscr znichollscr mentioned this issue Nov 25, 2024
2 tasks
@durack1
Copy link
Contributor Author

durack1 commented Dec 2, 2024

No problem folks, these 3 new files are now live - see
https://aims2.llnl.gov/search?project=input4MIPs&activeFacets={"mip_era":"CMIP6Plus","institution_id":"SOLARIS-HEPPA"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants