-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚠️ CI failed ⚠️ #271
Comments
The two regressions detected are:
@ian-r-rose Do you think this latest one is related to dask/dask#9397 (comment) ? |
Interesting, it doesn't look to me like the timing lines up (that fix was merged on Aug 19). I'm not sure, but it's worth investigating |
Timing wise, all dask/dask changes that might be relevant git log --since='2022-08-15 14:15' --until='2022-08-18 14:15' --pretty=oneline
all dask/distributed changes that may be relevant
Nothing suspicious jumps out. Of course, this might also be a coiled related change. cc @ntabris @shughes-uk is there a way for us to see when there were coiled deployments? |
Do you have access to datadog? If so you can look at the APM services and check the deployments, here's a link that shows the month's deployments This will show the last deployment and the version contains the timestamp: |
The only recent platform change that I'd expect to possibly impact perf was setting the cgroup for memory on the container. This means the machine won't freeze from dask using too much memory, but it also means that dask potentially won't use as much memory (e.g., it will restart worker before hitting the ceiling that it was hitting before). This change went to prod on evening of August 16. |
I guess this is it then. This change not only protects from OOM but also changes a few internal memory monitoring mechanisms. For instance, if these workloads are operating at high memory pressure, chances are that we're now spilling more data making the entire thing a bit slower Edit: I ran the workload myself and strongly doubt that this is related to the changes to cgroups/memory limits. This workload is far away from any limiting/spilling. This is something else. If I drive the workload myself it also finishes about 25% faster than if it's running in the benchmark suite. We've seen some of these systematic problems in other tests as well where we suspect that the scheduler is accumulating some state that slows everything down in turn, e.g. #253 Now I'm wondering what kind of additional tests we started to run during that time period. Indeed, we started to run much more workloads during that time 82d6b21 (main) Integration tests for spilling (#229) cc @gjoseph92 |
Didn't have a chance to test this but just skimming our code base, this change could help dask/distributed#6944 |
Notice that the regression is in the durations but not as much in the memory. @ian-r-rose and I think that since this query is a |
@ncclementi you might be interested in reviewing/testing #269 in this context :) |
After doing a git bisect it looks like the regression comes from the Shuffle PR see dask/dask#9428 for reference. I pinged Rick since he wrote the PR to take a look at it and get some feedback. |
Closing this as the regression was exposed after fixing a bug. |
Workflow Run URL
The text was updated successfully, but these errors were encountered: