Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for configuring Dask distributed #2049

Merged
merged 15 commits into from
Jun 1, 2023

Conversation

bouweandela
Copy link
Member

@bouweandela bouweandela commented May 19, 2023

Description

Add support for configuring the Dask scheduler/cluster that is used to run the tool. See the linked documentation for information on how to use the feature.

Use this together with ESMValGroup/ESMValTool#3151 to let the Python diagnostics also make use of the Dask cluster.

Closes #2040

Link to documentation: https://esmvaltool--2049.org.readthedocs.build/projects/ESMValCore/en/2049/quickstart/configure.html#dask-distributed-configuration


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@bouweandela bouweandela added the enhancement New feature or request label May 19, 2023
@bouweandela bouweandela added this to the v2.9.0 milestone May 22, 2023
@codecov
Copy link

codecov bot commented May 22, 2023

Codecov Report

Merging #2049 (2b428a3) into main (3d6ed66) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #2049      +/-   ##
==========================================
+ Coverage   92.87%   92.90%   +0.03%     
==========================================
  Files         234      235       +1     
  Lines       12447    12506      +59     
==========================================
+ Hits        11560    11619      +59     
  Misses        887      887              
Impacted Files Coverage Δ
esmvalcore/_main.py 90.87% <100.00%> (+0.07%) ⬆️
esmvalcore/_task.py 72.42% <100.00%> (+0.80%) ⬆️
esmvalcore/config/_dask.py 100.00% <100.00%> (ø)
esmvalcore/experimental/recipe.py 90.16% <100.00%> (+0.16%) ⬆️

@bouweandela bouweandela marked this pull request as ready for review May 23, 2023 08:34
@bouweandela bouweandela requested a review from fnattino May 23, 2023 08:34
Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small bits and bobs before I go into the Python scripts, bud 🍺

'`scheduler <https://docs.dask.org/en/stable/scheduling.html>`_'.
The default scheduler in Dask is rather basic, so it can only run on a single
computer and it may not always find the optimal task scheduling solution,
resulting in excessive memory use when using e.g. the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resulting in excessive memory use when using e.g. the
resulting in excessive memory use when running an already memory-intensive task like e.g. the

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem isn't so much that it's memory-intensive, but that the task graph becomes too complicated for the built-in scheduler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - for regular Joe the Modeller: moar memory! Let's scare them before they even think of touching anything 😁

doc/quickstart/configure.rst Outdated Show resolved Hide resolved
doc/quickstart/configure.rst Show resolved Hide resolved
environment.yml Show resolved Hide resolved
environment.yml Outdated Show resolved Hide resolved
bouweandela and others added 3 commits May 23, 2023 15:25
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple more comments, bud. test_dask.py to follow, then we done here 😁

esmvalcore/config/_dask.py Show resolved Hide resolved
if CONFIG_FILE.exists():
config = yaml.safe_load(CONFIG_FILE.read_text(encoding='utf-8'))
if config is not None:
dask_args = config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a warning would be nice, telling the user to have the config available and configured if they want to use dasky stuff

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 25dc5ce

esmvalcore/config/_dask.py Show resolved Hide resolved
setup.py Show resolved Hide resolved
Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice work @bouweandela 🎖️ Couple comments I left earlier that have not been resolved, up to you to address, no biggie. It looks very nice and hopefully it'll work well in practice too ie iris will play well

@valeriupredoi valeriupredoi added the dask related to improvements using Dask label May 23, 2023
@valeriupredoi
Copy link
Contributor

btw I've just created a label called "Dask" - let's use that for Dask-related improvements, and maybe even collect Dask items in the changelog, at least for this upcoming release, that'll be Dask-heavy. Wanted to call it Bouwe instead of Dask 😁

Copy link
Contributor

@remi-kazeroni remi-kazeroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @bouweandela! Just taking a look at the well written Docs to get a better idea of these new developments.

doc/quickstart/configure.rst Outdated Show resolved Hide resolved
doc/quickstart/configure.rst Outdated Show resolved Hide resolved
Co-authored-by: Rémi Kazeroni <remi.kazeroni@dlr.de>
@schlunma schlunma self-requested a review May 24, 2023 13:45
@schlunma
Copy link
Contributor

I'd like to test this with a couple of recipes, please do not merge just yet 😁

@valeriupredoi
Copy link
Contributor

I'd like to test this with a couple of recipes, please do not merge just yet grin

good call, Manu! 🍺

Copy link
Contributor

@schlunma schlunma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bouweandela, I just tested this thoroughly - this is really amazing stuff and works super well! 🚀 For the first time ever I was able to run ESMValTool on multiple nodes! In addition, monitoring an ESMValTool run with a dask dashboard is also super convenient. See here a nice visualization of our example recipe:

esmvaltool_dask

I have some general comments/questions:

  • How does this interact with our current multiprocessing-based parallelization? I tested this with different --max-parallel-tasks options and all seemed to work well, but I do not quite understand how. Does every process use it's own cluster? Or do they share it? Is this safe?
  • As far as I understand the only way to use multiple nodes in a SLURMCluster is to use the cluster.scale(jobs=n_jobs) method, see here. Would it make sense to add a scale option to the configuration file where one could add keyword arguments for scale?
  • When running this, I always get UserWarning: Sending large graph of size 56.83 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures., even for the example recipe. Not sure if this is/will be a problem. Not something we need to tackle here though; just wanted to document it.

Thanks!! 👍

Comment on lines +264 to +267
Create a Dask distributed cluster on the
`Levante <https://docs.dkrz.de/doc/levante/running-jobs/index.html>`_
supercomputer using the `Dask-Jobqueue <https://jobqueue.dask.org/en/latest/>`_
package:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to mention that this needs to be installed by the user (e.g., mamba install dask-jobqueue) because it's not part of our environment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @valeriupredoi's suggestion of just adding it to the dependencies. It doesn't have any dependencies that we do not already have and it's a very small Python package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, that's even better!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 25dc5ce

@valeriupredoi
Copy link
Contributor

That's wonderful to see, Manu, awesome you tested, and of course, awesome work by Bouwe! I'll let Bouwe answer your q's (although I know the answer for the first q from code review hehe), but about the jobqueue dep - we should probably just add it in our env, I'd reckon whatever's Dask-related stuff should be tweaked at a minimum by the user. About data scattering - I've done it before, and have seen marginal performance improvement, but I used fairly small data sizes, so it could be better for busloads of data, as we have 😁

@bouweandela
Copy link
Member Author

Thanks for testing @schlunma!

How does this interact with our current multiprocessing-based parallelization? I tested this with different --max-parallel-tasks options and all seemed to work well, but I do not quite understand how. Does every process use it's own cluster? Or do they share it? Is this safe?

When no cluster is configured, every process will use it's own. This is what is causing memory errors at the moment. With the pull request, this is fixed (provided that a user sets up a cluster), by passing the scheduler address to the process running the task and making it connect to the cluster. You can see that here in the code:

future = pool.apply_async(_run_task,
[task, scheduler_address])

def _run_task(task, scheduler_address):
"""Run task and return the result."""
if scheduler_address is None:
client = contextlib.nullcontext()
else:
client = Client(scheduler_address)
with client:
output_files = task.run()
return output_files, task.products

The multiprocessing configuration is used to parallelize the metadata crunching (i.e. all the stuff with coordinates etc). In a future pull request, I intend to make it possible to also run that on a Dask cluster, see #2041.

As far as I understand the only way to use multiple nodes in a SLURMCluster is to use the cluster.scale(jobs=n_jobs) method, see here. Would it make sense to add a scale option to the configuration file where one could add keyword arguments for scale?

This can also be configured with the parameter n_workers:
https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html#dask-jobqueue-slurmcluster

When running this, I always get UserWarning: Sending large graph of size 56.83 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures., even for the example recipe. Not sure if this is/will be a problem. Not something we need to tackle here though; just wanted to document it.

I suspect this warning occurs because some of the preprocessor functions are not lazy, e.g. multi_model_statistics.

@schlunma
Copy link
Contributor

When no cluster is configured, every process will use it's own. This is what is causing memory errors at the moment.

Could you elaborate on that? What do you mean by "at the moment"? The current main? Do you have an example for such a memory error?

With the pull request, this is fixed (provided that a user sets up a cluster), by passing the scheduler address to the process running the task and making it connect to the cluster. The multiprocessing configuration is used to parallelize the metadata crunching (i.e. all the stuff with coordinates etc).

True, could have guessed that from the code 😁 I suppose the cluster is smart enough to handle input from different processes? Are there any recommendations for the ratio max_parallel_tasks/n_workers? E.g., does it make sense to use 16 parallel tasks with just 1 worker?

In a future pull request, I intend to make it possible to also run that on a Dask cluster, see #2041.

Cool!

This can also be configured with the parameter n_workers: https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html#dask-jobqueue-slurmcluster

I tried that with n_workers=4, but just got 1 node. I now used n_workers=16 and got 4 nodes. So I guess the number of nodes is a combination of the different parameters.

@schlunma Did you also have a chance to test this with some recipes using a large amount of data? I would be interested in your results.

Not yet, but I can try to run a heavy recipe with this 👍

@schlunma
Copy link
Contributor

Alright, I did a lot of tests now. Note: Varying max_parallel_tasks did not significantly change the outcome in any test below.

Test with existing heavy recipe using distributed.LocalCluster on a 512 GiB compute node

Bad news first: I couldn't get one of the heaviest recipes I know (recipe_schlund20esd.yml) to run. I tried many different options (see table below), but always got a concurrent.futures._base.CancelledError (see log below).

Test with simple heavy recipe that uses a purely lazy preprocessor using distributed.LocalCluster on a 512 GiB compute node

I then created a very simple heavy recipe that uses some high-res CMIP6 data:

preprocessors:
  test:
    regrid:
      target_grid: 2x2
      scheme: linear

diagnostics:

  test:
    variables:
      ta:
        preprocessor: test
        mip: Amon
        project: CMIP6
        exp: hist-1950
        start_year: 1995
        end_year: 2014
    additional_datasets:
      - {dataset: CMCC-CM2-VHR4,   ensemble: r1i1p1f1, grid: gn}
      - {dataset: CNRM-CM6-1-HR,   ensemble: r1i1p1f2, grid: gr}
      - {dataset: ECMWF-IFS-HR,    ensemble: r1i1p1f1, grid: gr}
      - {dataset: HadGEM3-GC31-HM, ensemble: r1i1p1f1, grid: gn}
      - {dataset: MPI-ESM1-2-XR,   ensemble: r1i1p1f1, grid: gn}
    scripts:
      null

The good news: With a purely lazy preprocessor (regrid), it worked pretty well and I also got a significant speed boost.

n_workers memory_limit run time memory usage comment
4 128 GiB 3:59 min 3.6GB
16 32 GiB 2:07 min 17.9GB default settings when using distributed.LocalCluster() without any arguments
8 32 GiB 1:56 min 8.7GB
- - 9:31 min 15.7GB no cluster (i.e., current main)

As you can see, I could easily get a speed boost of ~5!

Test with simple heavy recipe that uses a non-lazy preprocessor using distributed.LocalCluster on a 512 GiB compute node

preprocessors:
  test:
    area_statistics:
      operator: mean

diagnostics:

  test:
    variables:
      ta:
        preprocessor: test
        mip: Amon
        project: CMIP6
        exp: hist-1950
        start_year: 1995
        end_year: 2014
    additional_datasets:
      - {dataset: CMCC-CM2-VHR4,   ensemble: r1i1p1f1, grid: gn}
      - {dataset: CNRM-CM6-1-HR,   ensemble: r1i1p1f2, grid: gr}
      - {dataset: ECMWF-IFS-HR,    ensemble: r1i1p1f1, grid: gr}
      - {dataset: HadGEM3-GC31-HM, ensemble: r1i1p1f1, grid: gn}
      - {dataset: MPI-ESM1-2-XR,   ensemble: r1i1p1f1, grid: gn}
    scripts:
      null

However, when using a non-lazy preprocessor (area_statistics), I always got the same concurrent.futures._base.CancelledError as for the other heavy recipe, regardless of the settings (main_log_debug.txt).

I also tried combinations where the cluster only had a small fraction of the memory (e.g., n_workers: 4, memory_limit: 64 GiB, i.e., 256 GiB remain!), which also did not work.

I'm not familiar enough with dask to tell why this is not working, but it looks like this is related to using a non-lazy function. I thought one could maybe address by allocated enough memory for the "non-dask" part, but that also didn't work.

Tests with dask_jobqueue.SLURMCluster

I got basically the same behavior as above. With the following settings the lazy recipe ran in 8:27 min (memory usage of 0.8 GB; not quite sure if that is correct or wrong because the heavy stuff is now done on a different node):

cluster:
  type: dask_jobqueue.SLURMCluster
  ...
  queue: compute
  cores: 128
  memory: 32 GiB
  n_workers: 8

There's probably a lot to optimize here, but tbh I don't full understand the all the options for SLURMCluster yet 😁

To sum this all up: for lazy preprocessors this works exceptionally well! 🚀

@bouweandela
Copy link
Member Author

Could you elaborate on that? What do you mean by "at the moment"? The current main? Do you have an example for such a memory error?

Yes, in the current main branch. Maybe the recipe #1193 could be an example?

@valeriupredoi
Copy link
Contributor

hellow cool stuff! Very cool indeed @schlunma 💃 Here's some prelim opinions:

  • lazy preprocs are the only ones that work as expected because this whole machinery is optimized for lazy stuff
  • settings depend on the amount of data and computation needed - and resources used really do vary based on the cluster settings - afraid there is no best set of settings parameters (eg try half the data for your lazy regrid runs and you'll see that performance changes and other settings return the best balance)
  • the concurrent Futures issue seems o be coming from very small scattered data packets that get created, then deleted then recreated, see eg Getting concurrent.futures._base.CancelledError from simple binary tree built from futures dask/distributed#4612 - it'd be good to first scatter the data (since it's not lazy) then do computations, since once scattered the data is smaller, and default scattering by the cluster may not happen anymore (that is bad, since the cluster doesn't really know how to do it properly)

Very cool, still 🍺

@bouweandela
Copy link
Member Author

I suppose the cluster is smart enough to handle input from different processes? Are there any recommendations for the ratio max_parallel_tasks/n_workers?

Provided that all preprocessor functions are lazy, I would recommend setting max_parallel_tasks equal to the number of CPU cores in the machine, since coordinates are mostly realized, meaning this will use on average 1 CPU core per task. I suspect that in most cases it does not make sense to have n_workers larger than max_parallel_tasks until we have delayed saving mentioned in #2042, since a task can only save one file at a time and I would expect a worker to be able to process all data required for a single output file. Depending on what you're actually trying to compute of course.

@valeriupredoi
Copy link
Contributor

and I would expect a worker to be able to process all data required for a single output file

that's the ideal case, right? Minimum data transfer between workers - that's why Manu's jobs are failing now for non-lazy pp's - too much ephemeral data exchange between workers, I thnk

@valeriupredoi
Copy link
Contributor

this post says the Cancelled error comes from some node/worker being idle for too long - interesting - so it's either too short a computation that deletes the key on the worker, but the Client still needs it/recreates it, or the thing's too long - it'd be worth examining the "timeout" options see https://docs.dask.org/en/stable/configuration.html - beats me which one though, used to be a scheduler_timeout

@schlunma
Copy link
Contributor

I now also tested the (fully lazy) recipe of #1193 now (which used to not run back then):

# ESMValTool
---
documentation:
 description: Test ERA5.

 authors:
   - schlund_manuel

 references:
   - acknow_project


preprocessors:

 mean:
   zonal_statistics:
     operator: mean


diagnostics:

 ta_obs:
   variables:
     ta:
       preprocessor: mean
       mip: Amon
       additional_datasets:
         - {dataset: ERA5, project: native6, type: reanaly, version: v1, tier: 3, start_year: 1990, end_year: 2014}
   scripts:
     null

The good new is: it is now running without any dask configuration, and I could also get a significant speed up (5:41 min -> 1:56 min) by using a LocalCluster.

Replacing the lazy zonal_statistics with the non-lazy area_statistics lead to the same error as already reported.

Comment on lines 18 to 24
logger.warning(
"Using the Dask basic scheduler. This may lead to slow "
"computations and out-of-memory errors. See https://docs."
"esmvaltool.org/projects/ESMValCore/en/latest/quickstart/"
"configure.html#dask-distributed-configuration for information "
"on how to configure the Dask distributed scheduler."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too happy with this being a warning at the current stage: as shown by my tests, using a distributed scheduler can actually lead to recipes not running anymore. Thus, I would be very careful recommending this to users at the moment.

We should either phrase this more carefully (maybe add that this is an "experimental feature") or convert it to an info message. Once we are more confident with this we can change it back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: this does not raise a warning if the dask config file exists but is empty. Should we also consider this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it back to INFO if @valeriupredoi agrees because he asked for a WARNING in #2049 (comment). Note that not configuring the scheduler can also lead to recipes not running, so it rather depends on what you're trying to do what the best setting is, as also noted in the documentation.

Copy link
Contributor

@valeriupredoi valeriupredoi May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that was me before Manu's actual testing. But alas, Bouwe is right too - I'd keep it as a warning but add what Manu suggests - experimental feature with twitchy settings that depend on the actual run

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: this does not raise a warning if the dask config file exists but is empty. Should we also consider this case?

Won't that default to a LocalCluster etc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep it as a warning but add what Manu suggests - experimental feature with twitchy settings that depend on the actual run

Fine for me!

Won't that default to a LocalCluster etc?

No, this also results in the basic scheduler: https://github.com/ESMValGroup/ESMValCore/pull/2049/files#diff-b046a48e3366bf6517887e3c39fe7ba6f46c0833ac02fbbb9062fb23654b06bdR64-R69

Copy link
Contributor

@valeriupredoi valeriupredoi May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aw. Nicht gut. That needs be communicated to the user methinks - ah it is, didn't read the very first sentence 😆 Fine for me then

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 80f5c62

Copy link
Contributor

@schlunma schlunma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes @bouweandela! I love the changes to the documentation, with this I could get a recipe to run in 30s which previously took 6 mins!

Just ouf of curiosity: where did you get that n_jobs = n_workers / processes from? I searched for that yesterday but couldn't find anything.

Please make sure to adapt the warning as discussed in the corresponding comment. Will approve now since I will be traveling next week.

Thanks again, this is really cool! 🚀

@valeriupredoi
Copy link
Contributor

where did you get that n_jobs = n_workers / processes from?

I think that's just Bouwe being sensible, as he usually is 😁 ie you'd ideally want one process per worker

@valeriupredoi
Copy link
Contributor

awesome testing @schlunma - cheers for that! I'll dig deeper into those errors you've been seeing once I start testing myself, sometime next week 👍

Copy link
Contributor

@sloosvel sloosvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bouweandela ! I managed to create some SLURM clusters just fine with these changes. Regarding the configuration, is it possible to have multiple configurations in dask.yml? Not every recipe will require the same type of resources.

However, I am running in a machine that has quite modest technical specifications, to put it nicely, and even running in a distributed way does not seem to help much in the performance and memory usage.

Some issues that were found for instance:

  • The concatenation of multiple files seems to slow down
  • The heaviest jobs all seem to die during the saving of the files.
  • Even for jobs that finish successfully, I get errors in the middle of the debug log about workers dying, not being responsive and communications dying. Not sure where these are coming from.

It could very well be that I am not configuring the jobs properly though.

It is true though that everything that is already lazy performs even faster. The lazy regridder went from taking 4 minutes to run in a high resolution experiment, to taking less than 2 minutes.

Copy link

@fnattino fnattino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice @bouweandela, looks great to me! Nice that you have both the cluster and the client config sections, so that the following also works as a 'shortcut' to create a local cluster:

client:
  n_workers: 2
  threads_per_worker: 4

doc/quickstart/configure.rst Outdated Show resolved Hide resolved
@bouweandela
Copy link
Member Author

bouweandela commented May 30, 2023

Just ouf of curiosity: where did you get that n_jobs = n_workers / processes from? I searched for that yesterday but couldn't find anything.

@schlunma From the documentation of the processes argument:

Cut the job up into this many processes.

i.e. processes is the number of workers launched by a single job.

@bouweandela
Copy link
Member Author

bouweandela commented May 30, 2023

Regarding the configuration, is it possible to have multiple configurations in dask.yml? Not every recipe will require the same type of resources.

Good idea @sloosvel. @ESMValGroup/technical-lead-development-team Shall I add a warning that the current configuration file format is experimental and may change in the next release? We can then discuss how to best add this to the configuration at the SMHI workshop ESMValGroup/Community#98.

@bouweandela
Copy link
Member Author

@schlunma and @sloosvel Thank you for running the tests! Really good to learn from your experiences. Regarding the dying workers: I suspect they are using more than the configured amount of memory and then killed, so giving them more memory may solve the problem, though in case the preprocessor functions you're using are not lazy that may not help.

Regarding the area_statistics preprocessor function: this is actually lazy (except for median), but there seems to be something odd going on. If I run the example recipe provided by @schlunma in #2049 (comment) with just one model and one year of data, I already get the warning that 1.5 GB of data is sent to the workers (i.e. the size of the weights array), even though the weights array is lazy. Maybe #47 is applicable here.

@bouweandela
Copy link
Member Author

The concatenation of multiple files seems to slow down

@sloosvel If this is a problem for your work, could you please open an issue and add an example recipe so we can look into it?

@bouweandela
Copy link
Member Author

@ESMValGroup/technical-lead-development-team I think this is ready to be merged. Could someone please do a final check and merge?

@remi-kazeroni
Copy link
Contributor

Thanks a lot @bouweandela for your work on this great extension! It should definitely one of the highlights of the next Core release! And thanks to everyone who reviewed this PR and tested the new Dask capabilities! Merging now 👍

@remi-kazeroni remi-kazeroni merged commit 1c1e6f1 into main Jun 1, 2023
@remi-kazeroni remi-kazeroni deleted the configure-dask-distributed branch June 1, 2023 09:09
@remi-kazeroni remi-kazeroni mentioned this pull request Jun 1, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask related to improvements using Dask enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for configuring Dask distributed
6 participants