Do not realize data and use dask in derivation functions #42

mattiarighi · 2019-03-08T10:52:51Z

As reported by @bouweandela, many derivation functions still realize the data and use numpy instead of dask. This is detrimental for the performance and should be changed.

Affected variables:

amoc
gtfgco2 should not be needed anymore, there is a preprocessor function for this
sm
toz

The text was updated successfully, but these errors were encountered:

bouweandela · 2019-03-08T10:56:56Z

introduction to lazy data: https://scitools.org.uk/iris/docs/latest/userguide/real_and_lazy_data.html

If you need array functions to do things, use from dask import array as da instead of import numpy as np, see here for a description of dask array options: http://docs.dask.org/en/latest/array-api.html

bouweandela · 2019-03-08T11:01:05Z

See here for an example on multiplying a cube with a number without realizing data:
https://github.com/ESMValGroup/ESMValTool/blob/822941f52780dde2b0b122a9a8a99f23e313ef30/esmvaltool/cmor/_fixes/CMIP5/BNU_ESM.py#L154

And here for an example on using dask arrays:
https://github.com/ESMValGroup/ESMValTool/blob/822941f52780dde2b0b122a9a8a99f23e313ef30/esmvaltool/cmor/_fixes/CMIP5/BNU_ESM.py#L177-L178

valeriupredoi · 2019-03-11T13:41:09Z

just want to say, I ❤️ this thread 😁

mattiarighi · 2019-03-11T13:50:08Z

Great! Then it's yours 👍

valeriupredoi · 2019-03-11T14:03:54Z

yay, more crap for me!
FYI to us all, at the meeting with the iris folk last week I asked if they could explicitly say which iris funcs realize or keep the data lazy and Corinne has already started working on this (very important) info: SciTools/iris#3292

valeriupredoi · 2019-03-11T16:12:26Z

As reported by @bouweandela, many derivation functions still realize the data and use numpy instead of dask. This is detrimental for the performance and should be changed.

Affected variables:
* [] `amoc`

this should be fine, no actual accessing of the data member

* [] `gtfgco2`

this one needs to: remove the data access and mask construction and remove the building of a list of numpy arrays (bad worlf!)

* [] `sm`

just the last bit that builds the mask from the data

* [] `toz`

total mess, how do you set dtype to a dask array?

valeriupredoi · 2019-03-14T12:24:37Z

@zklaus would you be too angry with me if I asked you to look at this? I have a metric ton of crap that I need to take care of and I feel I am going to sideline this - plus you are working closely with the iris stuff anyways. Beer from me when we next meet 🍺

ledm · 2019-11-14T16:37:06Z

Similar to this discussion, is there a way to switch off writing the derived variables to disk? It seems to slow everything down and shouldn't be necessary.

bouweandela · 2019-11-15T08:35:31Z

You probably mean the input variables needed to derive a variable? In that case the answer is no.

ledm · 2019-11-15T16:16:21Z

Is it not possible to load the cubes into dask arrays, instead of saving them?

In the case of the derivation of OHC, it loads a 4D variable and saves it exactly as it is. It basically copies 20GB of data into the working directory for each dataset before doing any calculations! All I want is a scalar field, it should be a few kb!

Furthermore, we only have 100GB space in our home directories on jasmin, this means that there's only space for a few models using this method. (I will move my working directory somewhere with more space, but this still doesn't seem like a great method!)

bouweandela · 2019-11-28T09:35:42Z

Is it not possible to load the cubes into dask arrays, instead of saving them?

Maybe in the future, but not at the moment. Do you feel like implementing this yourself?

Furthermore, we only have 100GB space in our home directories on jasmin, this means that there's only space for a few models using this method. (I will move my working directory somewhere with more space, but this still doesn't seem like a great method!)

The Jasmin user guide recommends using a group workspace for storing large amounts of data: https://help.jasmin.ac.uk/article/176-storage, not your home directory. I started on pull request #265. which will make it possible to store preprocessor and other temporary data on a special temporary file system, but this is not ready yet.

ledm · 2020-05-29T09:48:18Z

Just a comment that gtfgco2 may still be needed. I've commented on the merged PR here #418 (comment)

but happy to continue the discussion here if needed.

bouweandela · 2024-06-10T12:42:49Z

Up-to-date overview and discussion in #2451.

mattiarighi changed the title ~~Do not realize data and use dask in derivation function~~ Do not realize data and use dask in derivation functions Mar 8, 2019

mattiarighi assigned valeriupredoi Mar 11, 2019

valeriupredoi assigned zklaus Mar 14, 2019

mattiarighi transferred this issue from ESMValGroup/ESMValTool Jun 11, 2019

mattiarighi added preprocessor Related to the preprocessor enhancement New feature or request labels Jun 11, 2019

This was referenced Jun 12, 2019

Use float32 for toz derivation instead of float64 #81

Merged

Make sm derivation lazy #82

Merged

Lazy derivation of gtfgco2 #83

Closed

mattiarighi mentioned this issue Oct 10, 2019

Added vegfrac as a derived variable #288

Merged

ledm mentioned this issue Nov 20, 2019

Derived variables save all data to disk before preprocessing #377

Closed

bouweandela mentioned this issue Jan 3, 2020

Remove derived variable gtfgco2 #418

Merged

4 tasks

bouweandela mentioned this issue Jun 12, 2020

Make preprocessor lazy #674

Open

62 tasks

bouweandela mentioned this issue Nov 25, 2020

Remove numba dependency #880

Merged

9 tasks

zklaus removed their assignment Feb 7, 2024

bouweandela added this to ESiWACE3 ESMValTool service Feb 12, 2024

bouweandela moved this to In Progress in ESiWACE3 ESMValTool service Feb 12, 2024

bouweandela mentioned this issue Feb 12, 2024

More lazy fixes and preprocessing functions #2325

Merged

8 tasks

bouweandela mentioned this issue Jun 10, 2024

Lazy derive preprocessor function #2451

Open

6 tasks

bouweandela closed this as not planned Won't fix, can't repro, duplicate, stale Jun 10, 2024

github-project-automation bot moved this from In Progress to Done in ESiWACE3 ESMValTool service Jun 10, 2024

github-project-automation bot added this to High priority issues Aug 28, 2024

github-project-automation bot moved this to Done in High priority issues Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not realize data and use dask in derivation functions #42

Do not realize data and use dask in derivation functions #42

mattiarighi commented Mar 8, 2019 •

edited

Loading

bouweandela commented Mar 8, 2019

bouweandela commented Mar 8, 2019

valeriupredoi commented Mar 11, 2019

mattiarighi commented Mar 11, 2019

valeriupredoi commented Mar 11, 2019

valeriupredoi commented Mar 11, 2019 •

edited

Loading

valeriupredoi commented Mar 14, 2019 •

edited

Loading

ledm commented Nov 14, 2019

bouweandela commented Nov 15, 2019

ledm commented Nov 15, 2019

bouweandela commented Nov 28, 2019

ledm commented May 29, 2020

bouweandela commented Jun 10, 2024

Do not realize data and use dask in derivation functions #42

Do not realize data and use dask in derivation functions #42

Comments

mattiarighi commented Mar 8, 2019 • edited Loading

bouweandela commented Mar 8, 2019

bouweandela commented Mar 8, 2019

valeriupredoi commented Mar 11, 2019

mattiarighi commented Mar 11, 2019

valeriupredoi commented Mar 11, 2019

valeriupredoi commented Mar 11, 2019 • edited Loading

valeriupredoi commented Mar 14, 2019 • edited Loading

ledm commented Nov 14, 2019

bouweandela commented Nov 15, 2019

ledm commented Nov 15, 2019

bouweandela commented Nov 28, 2019

ledm commented May 29, 2020

bouweandela commented Jun 10, 2024

mattiarighi commented Mar 8, 2019 •

edited

Loading

valeriupredoi commented Mar 11, 2019 •

edited

Loading

valeriupredoi commented Mar 14, 2019 •

edited

Loading