-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAG optimization in recent coffea slows down workflow performance #1139
Comments
For each of the examples I conducted 3+ experiments so that the results are faithful and can be re-produced. |
Maybe also worth pointing out, this behavior seemingly started with Coffea 2024.4.1; while testing this application with different versions of Coffea, and keeping other packages in the environment the same, this behavior with longer-running large tasks began when upgrading from 2024.4.0 to 2024.4.1, with no other changes being made to the environment (so awkward, dask_awkward, dask, and so on did not change). |
Thank you for this study, it is incredibly useful to dig into these sorts of things. There are many moving parts in generating taskgraphs and it's hard to monitor all of it. Finding tasks like this which are sensitive to changes is super useful! We can make integration tests out of it. Before I make comments, a question: did you use the same version of dask (and dask-distributed) in all of these test cases? Finally, one thing that happened recently is that "data touching" in awkward array had some regressions in the most recent versions, and could cause some workflows to load much more data than they need to (from awkward 2.6.5). Please check |
@JinZhou5042 are you able to dump the conda package solution for both cases? That would be helpful to have in general. |
Besides Below are package lists generated by packages_coffea2024.2.2.txt And these are dask part coffea2024.2.2
coffea2024.4.0
coffea2024.6.1
I didn't find |
@cmoore24-24 I don't know much about the |
You might also be interested in the total data transferred in each of the cases: There is only 1 worker, the x axis represents the time and the y represents the disk size increasement in MB. As seen, the later two have more data transferred than the first one, but inside of them they are pretty much the same. |
Interesting, in this most recent version it's picked up a synchronization point. Can you also dump for me the versions of |
coffea2024.2.2
coffea2024.4.0
coffea2024.6.1
|
Hi @JinZhou5042 and @lgray, I've done a couple checks on calculations being done in test_skimmer.py, and |
And, indeed, the versions from conda are all the same so you're consistently using the one with the data-touching bug. :-) So it all comes down to coffea... That's interesting! I'll look into what's going on between versions there. I know in a few versions there wasn't much change in how coffea was assembling some calls dask, but since you're moving month to month there may be some drift. I'll take a look at the PRs that went in. If you feel like |
I'm spending the day testing that application with many version combinations of coffea and dask, the issue might lie on the coffea site and I'm not so sure yet as of now, will update some results later... |
Here's the table showing different versions of coffea, dask, and uproot, along with the corresponding number of submitted tasks and the total execution time: I previously confused the concepts of restructuring the graph and optimizing the graph. It appears that in one of the My next step would be trying |
the |
Would it be easy to run the same analysis task using a different data source (e.g. xrootd from FNAL EOS or something else?) |
Hi @nsmith- I've put together a reproducer that has the moving parts of the analysis, so should be a faithful abridged version of the analysis that Jin has been posting results for. The reproducer includes the postprocessing step, which has the same step size as Jin's. The benchmark here is the Hbb sample, found in the lpcpfnano EOS, here: I hope this is what you had in mind, let me know if you run into issues. |
Yesterday, @cmoore24-24 and I have encountered noticeable performance discrepancies while running the same physics application (test_skimmer.py) with TaskVine. After some elimination, we believe this stems from recent updates to Coffea, specifically optimizations made to DAG generation that increased the number of tasks but also introduced some long-running outliers. These outliers are the main reason for the prolonged overall runtime of the application.
This observation was drawn from three representative tests, and besides differing versions of Coffea, all other environmental settings are identical:
There are three versions of Coffea in our experiments:
Coffea 2024.2.2
This version of Coffea generates 163 tasks in total, below is the task graph where the label on:
This figure shows how tasks are distributed and the total execution time is 666.4s.
As seen in the CDF of task execution time, it shows that 90.18% tasks finish in 91s and the longest one is 394.09s.
7 categories are there with each consisting of a bunch of files if look into the csv file.
category_info_coffea2024.2.2.csv
Coffea 2024.4.0
With using a later version of Coffea, Connor suggested that this version has the magic combination of the task graph changes (so we see 222 tasks here) where presumably the small tasks are completing more quickly, but still has the runtime of 10 minutes, meaning the long tasks aren't suffering so much. Which means there's a balance there that was changed along the way.
The task graph suggests that the structure has been drastically changed and we suspect there is a DAG optimization:
And the execution details, 652.1s in total, pretty much the same compared with the first one:
90.09% tasks finish in 65.42s and the longest one is 395.26s. Notably, most tasks run faster than the previous one, but due to the increased number of tasks and no performance gain with the longest tasks, the total duration isn't either improved or reduced.
We also see the increased number of categories with this version (10>7).
category_info_coffea2024.4.0.csv
Coffea 2024.6.1
This version seems to have a mysterious optimization of the dag, it has the same number of tasks (222) compared with Coffea2024.4.0 but the longest task runs much slower (554.99>395.26) than the other one, which results in the improved runtime.
The task graph:
Execution details (200s longer):
Execution CDF (90.54% tasks finish in 44.34s):
The same number of categories compared with Coffea2024.4.0 (10)
category_info_coffea2024.6.1.csv
From the comparisons, it's clear that Coffea 2024.4.0 optimized the DAG by introducing a wider variety of tasks and dividing the graph into finer granularities, which increased the overall concurrency and optimized the runtime for most tasks.
Coffea 2024.6.1 built on this by further shortening the runtime for the majority of tasks. However, it also lengthened the runtime for a few longer tasks, which extended the critical path needed to complete, thus not only failing to improve but actually reducing the overall application performance by 20%.
In our experience, optimizing long tasks is crucial for reducing the DAG's completion time. This includes strategies such as scheduling them in advance using existing information, replicating them to minimize recovery delays, or gathering useful data about long tasks during the graph-building phase.
We believe these optimizations likely lead to significant performance enhancements in many scenarios involving graph operations. However, we also want to identify under what circumstances these optimizations might degrade performance and would appreciate collaboration to resolve this issue.
tagging @dthain because this came up in a discussion with him
The text was updated successfully, but these errors were encountered: