Collapsing cubes with weights kwarg very inefficient (full realization of data, twice) #47

valeriupredoi · 2019-02-26T15:24:04Z

two used cases from running a recipe and profiling it via debug and resource file:

First instance example
-----------------------
2019-02-26 14:03:40.811397      7250.1  7576.5  108     2.7     0       0.606   0.006
2019-02-26 14:03:41.897243      7251.2  7577.7  108     3.8     0       0.606   0.006
2019-02-26 14:03:42.984943      7252.3  7578.9  108     5.8     0       0.606   0.006
2019-02-26 14:03:44.071390      7253.4  7580.0  108     6.4     0       0.606   0.006
2019-02-26 14:03:45.156149      7254.5  7581.2  109     7.9     0       0.606   0.006
2019-02-26 14:03:46.243115      7255.5  7582.4  108     9.4     0       0.606   0.006
2019-02-26 14:03:47.341183      7256.6  7583.6  109     11.1    1       0.606   0.006
2019-02-26 14:03:48.465922      7257.8  7584.8  111     11.7    1       0.606   0.006
2019-02-26 14:03:49.567116      7258.9  7586.0  110     13.2    1       0.606   0.006
2019-02-26 14:03:50.656928      7260.0  7587.1  96      0.6     0       0.606   0.007

2019-02-26 14:03:41,277 UTC [16489] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step average_region
2019-02-26 14:03:41,280 UTC [16489] DEBUG   esmvaltool.preprocessor:187 Running area_average(precipitation_flux / (kg m-2 s-1)   (time: 1740; latitude: 360; longitude: 720)
2019-02-26 14:03:41,284 UTC [16489] INFO    esmvaltool.preprocessor._area_pp:179 Calculated grid area:(1740, 360, 720)
2019-02-26 14:03:50,494 UTC [16489] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step cmor_check_data


Second instance example
-----------------------
2019-02-26 14:36:54.558816      9243.9  9778.3  108     12.4    1       0.708   0.007
2019-02-26 14:36:55.667920      9245.0  9779.5  111     13.5    1       0.708   0.007
2019-02-26 14:36:58.053553      9247.4  9781.9  100     0.5     0       0.708   0.007

2019-02-26 14:17:33,671 UTC [16489] INFO    esmvaltool.preprocessor._area_pp:179 Calculated grid area:(1740, 360, 720)
2019-02-26 14:36:57,901 UTC [16489] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step cmor_check_data

I am assigning this to me since if I can't optimize it myself then I can talk to Bill next week.

The text was updated successfully, but these errors were encountered:

valeriupredoi · 2019-02-26T16:47:25Z

OK - I found something very useful: here is a little function:

def simple_area(cube, coord1, coord2):
    grid_areas = iris.analysis.cartography.area_weights(cube)
    print('LAZY', cube.has_lazy_data())
    result = cube.collapsed([coord1, coord2],
                            iris.analysis.MEAN)
                            weights=grid_areas)
    print('LAZY', result.has_lazy_data())
    return result

if you run this on a 500MB file with weights in it will chuck in a max mem=1.3GB, if you don't apply the weights it will be blindingly fast and max mem=20MB. Note that with weights in the result cube will NOT have lazy data - @bjlittle why is it that using weights as keyword arg for collapsed data gets realized and as such, performance drops like a rock?

valeriupredoi · 2019-02-26T16:54:37Z

argh! I see now - in cube.py when passing weights as kwarg the aggregation is not lazy anymore 😢

valeriupredoi · 2019-02-26T17:00:35Z

OK - good resources:
SciTools/iris#3280
SciTools/iris#2418

@ledm how important are those weights in the area average calculations? ie is it something that alters the results by a significant margin that justifies their use?

valeriupredoi · 2019-02-26T17:09:47Z

ah nevermind! Lack of Area weights may introduce biases of avg 10degs for temperature (just tested with/without weights)

BenMGeo · 2019-03-11T08:23:35Z

I'm currently trying to resolve this in c3s_511:

A code snippet to work around weighting memory while increasing cpu time is:

def __apply_fun2cube__(self, cube, dims=None, function=None,
                           incl_weights = True, **kwargs):
        """
        applies function to a sliced cube (memory saving)
        """
        self.__logger__.info("====================================================")
        self.__logger__.info("Running apply_fun2_cube")
#        self.__logger__.info(cube)
        self.__logger__.info(dims)
        self.__logger__.info(kwargs)
        self.__logger__.info("still lazy?")
        self.__logger__.info(cube.has_lazy_data())
        
        try:
            if "time" not in dims:
    #        if "latitude" in dims:
                latlon_list = []
                
                for latlon in cube.slices(["latitude","longitude"]):
                    
    #                self.__logger__.info(latlon)
                    
                    latlon.remove_coord("day_of_month")
                    latlon.remove_coord("day_of_year")
                    latlon.remove_coord("month_number")
                    latlon.remove_coord("year")
                    
                    if incl_weights and "latitude" in dims:
                        latlon = latlon * self.map_area
                        latlon.standard_name = cube.standard_name
                        latlon.long_name = cube.long_name
                        
    #                self.__logger__.info(latlon)
    
                    latlon = latlon.collapsed(dims,
                                              function,
                                              **kwargs)
                    latlon_list.append(latlon)
            
                cube_list = iris.cube.CubeList(latlon_list)
    #            self.__logger__.info(cube_list)
    #            self.__logger__.info([c.coords for c in cube_list])
                
                new_cube = cube_list.merge_cube()
                
                self.__logger__.info("going path one")
        
#        except: 
            else:
                new_cube = cube.collapsed(dims, function, **kwargs)
                self.__logger__.info("going path two")
        except:
            new_cube = cube.collapsed(dims, function, **kwargs)
            self.__logger__.info("going path three")
            
#        self.__logger__.info(new_cube)
            
#        self.__logger__.info(function)
        self.__logger__.info("still lazy?")
        self.__logger__.info(cube.has_lazy_data())
        self.__logger__.info("====================================================")
        
        return new_cube

You can call the function like:

new_cube = self.__apply_fun2cube__(cube, dims=["latitude"], function=iris.analysis.MEAN)

You see, I'm still debugging and trying to catch everything that kills the function. It works in this way for 1 coord (supposed to work with all usual dimensions in a cube in ESMValTool) staying lazy. You can call it twice, if latitude goes first, in my tests (this is why I need the except... does not work on scalar coordinates!).

BenMGeo · 2019-03-11T08:25:47Z

Plus: It does not work for std_dev, but I could not find out why, yet.

valeriupredoi · 2019-03-11T13:51:05Z

hi @BenMGeo good stuff, man! There is an open SciTools/iris gitHub issue SciTools/iris#3129 that @bjlittle and me we talked about last week, you may want to comment on that one so maybe you and the iris guys can work together on it - we will want a solution straight into iris rather than something in esmvaltool 🍺

BenMGeo · 2019-03-13T11:45:15Z

It works now for STD_DEV, but of course, the weighted standard deviation is not the standard deviation of the weighted values (doh).

I can provide you with a clean function of the above for having a weighted mean, though.

I do not see any other solution to this than writing actual iris aggregators with - based on the iris version we use - xarray or daskarray supported lazy functions yourself.

I'll have to deep dive into the writing of custom aggregators and the packages xarray and dask to provide these aggregators to our codes. Whenever I got there, I'll have them proposed to iris.

valeriupredoi · 2019-03-13T12:01:17Z

great stuff! 🍺 I suggest getting in touch with the iris guys via the associated SciTools issue and present them with the solution, mention @bjlittle and myslef please so we can keep track of the implementation, it's best to have it straight in iris than in esmvaltool - but yeah, cool stuff so far!

valeriupredoi · 2019-03-28T16:55:06Z

@bouweandela good stuff - but have you tested the implementation of SciTools/iris#3299 within ESMValTool?

valeriupredoi · 2020-02-20T15:20:05Z

we can now pass lazy weights to collapse operations eg:

import dask.array as da
def simple_area(cube, coord1, coord2):
    grid_areas = iris.analysis.cartography.area_weights(cube)
    grid_areas = da.array(grid_areas)
    result = cube.collapsed([coord1, coord2],
                            iris.analysis.MEAN,
                            weights=grid_areas)
    return result

and the input data and the output data will be LAZY (iris2.4) but the execution time of the collapse operation with or without weights still differ by a factor of 50 and memory increases by 2-3x
@bjlittle -> any news on the iris front? 🍺

bouweandela · 2023-05-27T20:54:10Z

@valeriupredoi Is this still an issue?

schlunma · 2023-07-14T13:40:51Z

I have extensively tested this in SciTools/iris#5341, so this should not be an issue anymore. Please re-open if necessary.

valeriupredoi self-assigned this Feb 26, 2019

valeriupredoi changed the title ~~Area operations are very inefficient~~ Collapsing cubes with weights kwarg very inefficient (full realization of data, twice) Feb 26, 2019

mattiarighi transferred this issue from ESMValGroup/ESMValTool Jun 11, 2019

mattiarighi added the preprocessor Related to the preprocessor label Jun 11, 2019

bouweandela mentioned this issue Jun 12, 2020

Make preprocessor lazy #674

Open

62 tasks

bouweandela mentioned this issue May 30, 2023

Add support for configuring Dask distributed #2049

Merged

10 tasks

schlunma closed this as completed Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collapsing cubes with weights kwarg very inefficient (full realization of data, twice) #47

Collapsing cubes with weights kwarg very inefficient (full realization of data, twice) #47

valeriupredoi commented Feb 26, 2019

valeriupredoi commented Feb 26, 2019

valeriupredoi commented Feb 26, 2019

valeriupredoi commented Feb 26, 2019

valeriupredoi commented Feb 26, 2019

BenMGeo commented Mar 11, 2019 •

edited

Loading

BenMGeo commented Mar 11, 2019

valeriupredoi commented Mar 11, 2019

BenMGeo commented Mar 13, 2019 •

edited

Loading

valeriupredoi commented Mar 13, 2019

valeriupredoi commented Mar 28, 2019

valeriupredoi commented Feb 20, 2020 •

edited

Loading

bouweandela commented May 27, 2023

schlunma commented Jul 14, 2023

Collapsing cubes with weights kwarg very inefficient (full realization of data, twice) #47

Collapsing cubes with weights kwarg very inefficient (full realization of data, twice) #47

Comments

valeriupredoi commented Feb 26, 2019

valeriupredoi commented Feb 26, 2019

valeriupredoi commented Feb 26, 2019

valeriupredoi commented Feb 26, 2019

valeriupredoi commented Feb 26, 2019

BenMGeo commented Mar 11, 2019 • edited Loading

BenMGeo commented Mar 11, 2019

valeriupredoi commented Mar 11, 2019

BenMGeo commented Mar 13, 2019 • edited Loading

valeriupredoi commented Mar 13, 2019

valeriupredoi commented Mar 28, 2019

valeriupredoi commented Feb 20, 2020 • edited Loading

bouweandela commented May 27, 2023

schlunma commented Jul 14, 2023

BenMGeo commented Mar 11, 2019 •

edited

Loading

BenMGeo commented Mar 13, 2019 •

edited

Loading

valeriupredoi commented Feb 20, 2020 •

edited

Loading