-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplicate data_needed #6587
Deduplicate data_needed #6587
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 10h 6m 23s ⏱️ + 28m 43s For more details on these failures, see this check. Results for commit c8267eb. ± Comparison against base commit 4b24753. ♻️ This comment has been updated with latest results. |
d5324b7
to
b1529f4
Compare
Note | ||
---- | ||
Instead of number of tasks, we could've measured total nbytes and/or number of | ||
tasks that only exist on the worker. Raw number of tasks is cruder but simpler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, the previous algorithm was already random.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the top priority task was never random but defined by the heap. Even same priority tasks would be deterministic given deterministic heap internals.
Even the _select_keys_for_gather
/ data_needed_per_worker
was not random but rather insertion ordered.
Do we have an option to not use random? How about a str-compare of tasks.peek()
to make it deterministic?
distributed/worker_state_machine.py
Outdated
Yield the peer workers and tasks in data_needed, sorted by: | ||
|
||
1. first local, then remote | ||
2. if tied, by highest-priority task available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could trivially swap these two, reverting it to fetch higher priority tasks first as in main. As explained in the opening comment, this would cause up to 50MB worth of lower-priority tasks to be fetched from a remote host even if they're available on a local one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would indeed swap these and always prefer prioritized data
- The previous behavior always chose priority (next in data_needed) and fetched this locally if there was a local worker available. If priority is listed first in this heap, we'd have the same behavior. I would suggest to only engage in any changes of behavior here if we can back it up with benchmarks
- By preferring local workers I can see an edge case where many tasks / many GBs of data is on a couple of local workers but a couple of small high priority tasks are on a remote worker. These small high priority task would be needed to unblock an important downstream task (hence the high prio). We'd fetch a lot of data and risk even overflowing the worker until all local data is replicated before we even tried to fetch the remote data.
46065ff
to
39d789e
Compare
c31b7b6
to
4ecc014
Compare
This PR is no longer blocked by #6593 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm slightly worried about runtimes for very large clusters (think ~1k workers upwards). Maybe this is not a big deal on the workers but I could see this amounting to a few orders of magnitude slower in the average case for large clusters
- I think we should always prefer priority over anything else. Every other change should be verified with benchmarking
- We had a similar argument when testing the Scheduler.rebalance algorithm in AMM but I'm again wondering if we should write dedicated unit tests for this, e.g. define these things as functions and test them in isolation
Testing could look like
DATA_NEEDED = dict[str, HeapSet[TaskState]]
def _select_workers_for_gather(
host: str,
data_needed: DATA_NEEDED,
skip_workers: set[str] # union of in_flight_workers and busy_workers
) -> Iterator[tuple[str, HeapSet[TaskState]]]:
...
def _select_keys_for_gather(
data_needed,
available,
) -> tuple[list[TaskState], int]:
...
def _get_data_to_fetch(host: str, data_needed, skip_workers)-> Iterator[tuple[str, set[TaskState], float]]:
for worker, available in _select_keys_for_gather(
host=host,
data_needed=data_needed,
skip_workers=skip_workers
):
yield worker, *_select_keys_for_gather(data_needed=data_needed, available)
def test_prefer_priority():
tasks = []
data_needed = defaultdict(HeapSet)
host = "127.0.0.1"
local_worker = f"{host}:1234"
remote_worker = "10.5.6.7:5678"
ts1 = TaskState(
key="key",
priority=1,
nbytes=1,
who_has={remote_worker}
)
data_needed[remote_worker].push(ts1)
ts1 = TaskState(
key="key2",
priority=0,
nbytes=1,
who_has={local_worker}
)
data_needed[local_worker].push(ts1)
assert list(_get_data_to_fetch(host, data_needed, [])) == [
(remote_worker, {ts1}, 1)
(local_worker, {ts2}, 1)
]
This would at least provide us some lever to test a few of the edge cases encountered in this implementation. For instance, the mutability of data_needed and the cross play between _select_keys_for_gather
and _select_workers_for_gather
distributed/worker_state_machine.py
Outdated
get_address_host(worker) != host, # False < True | ||
tasks.peek().priority, | ||
-len(tasks), | ||
random.random(), | ||
worker, | ||
tasks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like that we now have a very simple lever to control the fetching behavior +1
I would suggest not to engage in any large conversations here. I think we could theorycraft this for a while but eventually I'd be interested in real world benchmarks. We might want to revisit this once we have a couple of test workloads ready to go.
distributed/worker_state_machine.py
Outdated
Yield the peer workers and tasks in data_needed, sorted by: | ||
|
||
1. first local, then remote | ||
2. if tied, by highest-priority task available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would indeed swap these and always prefer prioritized data
- The previous behavior always chose priority (next in data_needed) and fetched this locally if there was a local worker available. If priority is listed first in this heap, we'd have the same behavior. I would suggest to only engage in any changes of behavior here if we can back it up with benchmarks
- By preferring local workers I can see an edge case where many tasks / many GBs of data is on a couple of local workers but a couple of small high priority tasks are on a remote worker. These small high priority task would be needed to unblock an important downstream task (hence the high prio). We'd fetch a lot of data and risk even overflowing the worker until all local data is replicated before we even tried to fetch the remote data.
Note | ||
---- | ||
Instead of number of tasks, we could've measured total nbytes and/or number of | ||
tasks that only exist on the worker. Raw number of tasks is cruder but simpler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the top priority task was never random but defined by the heap. Even same priority tasks would be deterministic given deterministic heap internals.
Even the _select_keys_for_gather
/ data_needed_per_worker
was not random but rather insertion ordered.
Do we have an option to not use random? How about a str-compare of tasks.peek()
to make it deterministic?
for worker, tasks in list(self.data_needed.items()): | ||
if not tasks: | ||
del self.data_needed[worker] | ||
continue | ||
if worker in self.in_flight_workers or worker in self.busy_workers: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I initially suggested to go down this road but I would still like to raise the point that this implementation now would iterate over all workers every time we're calling _ensure_communicating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even worse, if I'm not mistaken, the below iteration is not even linear (even without the additional push).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iteration is O(n) to the number of peer workers that hold any data to fetch, and O(n*logn) to the number of workers that hold data to fetch and are neither in flight nor busy - split between O(n) in python bytecode * O(logn) in C.
In a worst-case scenario of a cluster with 1000 workers where you suddenly have data to fetch from all 1000 (and I seriously doubt this is realistic), you'd perform
- a for loop in pure python of 1000 iterations
- a single
heapq.heapify
on a list of 1000 elements, which costs O(n*logn) and is implemented in C - a for loop in pure python of 50 iterations (
distributed.worker.connections.outgoing
), each of which callsheapq.heappop
on a list of 950~1000 elements, which costs O(logn) and is implemented in C
At the next call to _ensure_communicating, before any worker has responded, you'll just repeat step 1, while skipping steps 2 and 3.
I strongly suspect this whole thing to be negligible in most cases.
I can easily improve this, moving aside the busy/in flight workers to a separate attribute (e.g. WorkerState.busy_workers), thus making the second call O(1). However if you agree I'd rather do it in a follow-up PR.
distributed/worker_state_machine.py
Outdated
heapq.heappush( | ||
heap, | ||
(is_remote, tasks.peek().priority, -len(tasks), rnd, worker, tasks), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tying a knot in my head when thinking about runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heappop/heappush are O(logn) and are implemented in C
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not talking about the heappush itself but rather that it is pushing on the heap we're currently iterating over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, initially there was no heap, just a sort(). But then I realised that, if you have tasks with multiple replicas, your order after the first worker may change, so you need to re-sort. a heap is the most efficient way to do it.
OK, I'm swapping priority and local/remote around.
The previous algorithm was:
That would be pointless when the same task is available in multiple replicas from different workers.
I'm adding a statically seeded random state to obtain reproducibility. |
Would it be OK if I add unit tests (1) for the determinism and (2) to test that priority is chosen over locality as a follow-up after #6446? That PR makes it easier to write state tests. |
sure |
Added test about gather priority. |
to_gather={"x5"}, | ||
total_nbytes=4 * 2**20, | ||
), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing this test filled me with pure, unadultered joy ❤️
data_needed
; renamedata_needed_per_worker
todata_needed
._ensure_communicating
a second time in short succession would costO(t*log(t))
, where t is the number of tasks exclusively held by workers in flight or busy. The cost is nowO(w)
, where w is the number of workers in flight or busy.