-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove EnsureCommunicatingAfterTransitions #6462
Remove EnsureCommunicatingAfterTransitions #6462
Conversation
[EDIT] this comment refers to a previous version of the PR. The condition described below is still there, but only triggered by multiple events in short sequence. This PR introduces a O(n^2*logn) condition where
This should be negligible most of the time. I'll write another PR later on to make running _ensure_communicating twice in a row truly negligible (blocked by #6388). To clarify: this condition is already there when you have many compute-task and acquire-replicas requests, which cause the data_needed queue to expand rapidly. This PR specifically extends the condition to tasks fetched within the same event. |
Unit Test Results 15 files + 12 15 suites +12 6h 40m 57s ⏱️ + 5h 53m 11s For more details on these failures, see this check. Results for commit e1058fe. ± Comparison against base commit 69b798d. ♻️ This comment has been updated with latest results. |
[EDIT] this comment refers to a previous version of the PR and is now obsolete. The transition log has changed from
to
This... is correct, but it's very counter-intuitive.
Again, all this is correct, but it's confusing; it took me an unhealthy amount of time to figure out why the |
2c32233
to
b7a4538
Compare
201bd67
to
17ce3ef
Compare
As I'm arguing in #6442 (comment) I believe this log should simply be removed |
I could reproduce the issue about the assumptions in the test about which keys to be fetched. This is not entirely obvious from the tests and I consider this a bit concerning. It's also not about any priorities, ordering, etc. but rather that we're using an unordered set for distributed/distributed/worker.py Lines 2125 to 2129 in 6d85a85
This randomness was previously buffered by the delayed _ensure_communicating. I don't feel great about introducing randomness to our scheduling (e.g. there is not even a key-based tie breaker). This entire refactoring effort is supposed to help us make things more deterministic. |
This has been parked until a new design is agreed upon in #6497. |
aa5273d
to
1190bcc
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 10h 10m 51s ⏱️ +27s For more details on these failures, see this check. Results for commit 9051170. ± Comparison against base commit 88e1fe0. ♻️ This comment has been updated with latest results. |
1190bcc
to
ab0e9a1
Compare
ab0e9a1
to
5715261
Compare
The PR has been rewritten from scratch and is now ready for review and merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
return merge_recs_instructions( | ||
(recommendations, []), | ||
self._ensure_communicating(stimulus_id=ev.stimulus_id), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love that this is not everywhere anymore ❤️
Partially closes #6497
#6165 introduced this hack, for the sake of being functionally identical to the previous code.
This is a cleaner redesign which is conceptually the same.