-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CMIP3 processing does not work in parallel mode #430
Comments
Could you please attach an example recipe that reproduces the problem? |
Here it goes... |
so here's a dumb question - I did take a look at the CMIP3 code changes and nothing struck me as obvious that may limit the parallel running - can you run any other type of data in parallel? |
Yes, I always run the tool in parallel. |
I also noticed it only hangs on the first dataset in the recipe, if I use the third one instead everything works fine. |
it looks to me like a data loading issue with
I get the same behaviour for MIPIM/ECHAM5 so there is a trend here, buuut with IPSL-CM4 all goes finey-fine |
OK I figured out where the roadblock is - at save point: it hangs right here (in debug mode one can see):
Interestingly enough there is no issue saving the originally loaded cube in multiproc mode, see:
BTW @jvegasbsc you can test on Jasmin since CMIP3 data is in |
Based on the stack trace I think the problem is actually with loading the data. This is only done when saving because all preprocessing steps are lazy. |
niet, comrade, it is save:
|
I turned on intermediary file saving |
I confirm that that stacktrace doesn't happen, well the thing goes through, when reducing |
Which model(s) did you try? |
or when running with IPSL-CM4 |
I ran MIPIM-ECHAM5 (problem with multiproc save), IPSL-CM4 (goes fine) and BCM2 (same problem as ECHAM5) |
what puzzles me is why this works:
and the saver inside the tool doesnt. @bouweandela any ideas? |
afraid upgrading to |
Huzzah!! I have managed to replicate the very same hang and Trace (via KeyBoardInterrupt) for a toy model and not for the bulky esmvalcore framework:
|
OK even more simple and with any sort of netCDF file (not necessarily CMIP3):
the Trace is coming from the lazy data computation in the same manner as for that CMIP3 file
It looks like one worker is tripping on the other worker that thinks it hasn't finished yet. It is interesting to see this happen only if I call |
and yes @bouweandela was (partially) correct - it's at
will hang in the same way, Ctrl-C-ing it
The specific setup for the problem is that the same function that needs to realize lazy data is called once then it's called again with
|
so for some odd reason, in the case of those CMIP3 models, the workflow is somehow calling a function that needs to realize the data before calling the same function in a |
There is one thing: CMIP3 models do not have dates in their name by default, so we read the file at the very beginning to know if the requested data is available or not. The thing that puzzles me is that this is done in a completely different cube |
and this particular scenario in which a function is run on the cube while |
Great detective work @valeriupredoi! Did you report the issue #430 (comment) with dask? |
no, should I report it to the dask folks, what you reckon? |
Yes, I guess so? If they tell us that's now how we're supposed to use dask we'll have learned something too ;-) I'm planning to spend some time to see if we can parallelize esmvaltool runs better with dask probably in May. |
Actually, I'm surprised no-one ran into this issue when reading the vertical levels to regrid to from another dataset, because the code that does that looks quite similar. |
I just found this issue as I encountered exactly the same problem on a copy of some CMIP6 data where the dates were removed from the file names (please don't do that!!!). Is there already some sort of solution for this problem? Otherwise maybe we should add a warning to that function where the cube is originally loaded to read the dates from the cubes? |
No, I think the way forward is to report the problem mentioned by @valeriupredoi in #430 (comment) with dask (or check if someone else already reported it) and see if they're willing to fix it. If not, we need to
|
OK I came back to this prepandemic issue, things have not changed - I have polished a bit the toy example: import dask.array as da
from dask import compute
from dask.distributed import Client
from multiprocessing import Pool
def my_func(x1):
a = da.arange(4)
print(a)
s = compute(a)
def main():
# call function like a peasant
my_func(2)
# call function like an mpi pro
pool = Pool(processes=2)
x = pool.map(my_func, [2])
# call function like a dask distributed pro
client = Client()
x = client.map(my_func, [2])
x = client.gather(x)
if __name__ == '__main__':
main() and it still shows the same behavior - note that if you use just the functional call and the dask cluster (comment out the mpi bit) all goes fine (albeit slowly, well, it's expected). I am going to post this to Dask GH now 👍 |
Ok guys, we have us an issue dask/dask#8218 |
It's good to have the issue and gain some experience with basic dask concepts. Please note that nothing here has anything to do with MPI. Some comments on the example program: import dask.array as da
from dask import compute
from dask.distributed import Client
from multiprocessing import Pool
def my_func(x1):
a = da.arange(4)
print(a)
s = compute(a) This def main():
# call function like a peasant
my_func(2)
# call function like an ~mpi~ pro
pool = Pool(processes=2)
x = pool.map(my_func, [2]) What is # call function like a dask distributed pro
client = Client()
x = client.map(my_func, [2])
x = client.gather(x) What is x supposed to be here? This if __name__ == '__main__':
main() It's unclear what the expected behavior of the program is. That makes it very hard to debug. |
@zklaus I reckon we should use the issue in Dask to comment to get those guys to understand the problem at hand. Cheers for your detailed view of the code - but the problem here is (as I told the Dask guys too) we shouldn't debug the toy model - that toy code is what I came up with to describe in minimal code what (I think, see my analysis process above) the chain of events is in our stack, and the reason why the analysis of CMIP3 data hangs. Please have a look at the whole thread here and see if you can come up with a better scale/toy model of the stack so we can decide if it's indeed an issue with the way we use dask or if it's an intrinsic issue with Dask itself 🍺 |
@valeriupredoi I agree with @zklaus that you may need to clarify a few things before the dask developers can do anything with your issue. From the comments it attracted, I can see that it is not quite clear enough now. Note that the trivially simple solution to most of the problems described in this issue would be to not fall back to reading the start and end year from the file if it cannot be obtained from the filename, i.e. remove this code: ESMValCore/esmvalcore/_data_finder.py Lines 89 to 102 in 090962b
and return the complete list of input files, without filtering on time if it is not possible to get the start and end time from the filename, here: ESMValCore/esmvalcore/_data_finder.py Lines 111 to 121 in 090962b
|
I like the idea of just returning everything. Things being lazy, the impact should be rather small, particularly considering that this happens when the time span was not deemed important enough to be included in the filename. |
I like that solution too! Let's plug it in and test, then I can close the issue at Dask - the problem is very simple in my view, but they haven't actually understood what's going on - in fairness, the setup of the used case really doesn't make any sense to anyone with a sane engineer mind anyway, but maybe just a red flag for us not to use dask with compute when we're running the same bit of code in two places, one of them sent to mpi - anyways, I think we've learned enough from this issue, cheers guys! 🍺 |
Yes, let's try the solution suggested by Bouwe. Let's also note that MPI is nowhere near any of this, so let's not keep bringing it up. If there is a problem in dask, it would be good to get to the bottom of it. But we are not using |
As discussed in #269, the processing of CMIP3 datasets works only if
max_parallel_tasks = 1
is set inconfig-user.yml
.This could be a serious limitation, e.g. for the ipcc recipe developed here.
The text was updated successfully, but these errors were encountered: