Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do we want to CMORize all observational datasets? #1120

Open
bascrezee opened this issue Jun 4, 2019 · 5 comments
Open

Why do we want to CMORize all observational datasets? #1120

bascrezee opened this issue Jun 4, 2019 · 5 comments

Comments

@bascrezee
Copy link
Contributor

Maybe we should rather define certain interfaces to existing packages that take care of reading datasets into common Python data structures. E.g. particularly suitable for reading a very diverse set of data in different formats is intake. Another interesting project focusing on satellite datasets is open data cube.

@zklaus
Copy link

zklaus commented Jun 4, 2019

@hb326 thoughts? Comments?

@bascrezee bascrezee changed the title Do we really need to CMORize all observational datasets? Why do we want to CMORize all observational datasets? Jun 4, 2019
@bascrezee
Copy link
Contributor Author

bascrezee commented Jun 4, 2019

@mattiarighi just answered this question offline to me, his answer is :

because we want to have a pool of observational data

@bouweandela
Copy link
Member

That does not really answer the question, because you can also have a pool of observational data without reformatting it.

I think the real answer is probably more a perceived run-time advantage, reformatting takes some time, so if you have to do it every time you run a recipe, it could potentially be slower.

@valeriupredoi
Copy link
Contributor

I'll take a step back and point to a few things:

  • reformatting needs to be done according to a number of standards: CF and CMOR conventions most importantly but also ESMValTool-specific conventions that are not forcefully imposed but it makes life easier (ie preferred time units, preferred metadata items etc) so it's much better if reformatting is done once and reformatted data is shoved into a box from where it can be used right out the box;
  • as @bouweandela points out reformatting can be done on the fly but can potentially be time and CPU consuming depending on how much data needs to be converted;
  • one social aspect to this matter is the user's comfort knowing that there is a database with nicely formatted data where they can just point the tool to and all is done smoothly (ie risk of tool's failure is smaller since it needs to perform less actions) - same aspects the ESGF nodes provide the user - a nice place where nice data lives (nice my arse, given how many problem ESGF data has but that's a different fish altogether)
  • lastly, the question of LARGE datasets comes in my mind, the ones that are so large that they can't be stored in one place but need to be reformatted on the fly -> that is something that should probably be done on the fly, but apart from that I reckon if we can store data then we should run the reformatting as few times as possible
  • 🍺

@mattiarighi
Copy link
Contributor

mattiarighi commented Sep 29, 2020

Talking about intake, today's tech-talk by DKRZ on this subject might be of interest:
https://www.dkrz.de/up/de-news-and-events/de-tech-talks/de-dkrz-tech-talk-intake-taking-the-pain-out-of-data-access

It should be available on their youtube channel soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants