Deduplicate data_needed #6587

crusaderky · 2022-06-16T16:42:32Z

Partially closes Alternatives for current ensure_communicating #6497
Complements Remove EnsureCommunicatingAfterTransitions #6462
Remove data_needed; rename data_needed_per_worker to data_needed.
Fixes use case described in Remove EnsureCommunicatingAfterTransitions #6462 (comment), where calling _ensure_communicating a second time in short succession would cost O(t*log(t)), where t is the number of tasks exclusively held by workers in flight or busy. The cost is now O(w), where w is the number of workers in flight or busy.

github-actions · 2022-06-16T18:00:36Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±    0       15 suites ±0 10h 6m 23s ⏱️ + 28m 43s
  2 897 tests +    1   2 812 ✔️ +    2   84 💤 +  1 1 ❌ - 2
21 458 runs +830 20 491 ✔️ +793 966 💤 +39 1 ❌ - 2

For more details on these failures, see this check.

Results for commit c8267eb. ± Comparison against base commit 4b24753.

♻️ This comment has been updated with latest results.

crusaderky · 2022-06-17T11:31:03Z

distributed/worker_state_machine.py

+        Note
+        ----
+        Instead of number of tasks, we could've measured total nbytes and/or number of
+        tasks that only exist on the worker. Raw number of tasks is cruder but simpler.


At this point, the previous algorithm was already random.

Well, the top priority task was never random but defined by the heap. Even same priority tasks would be deterministic given deterministic heap internals.
Even the _select_keys_for_gather / data_needed_per_worker was not random but rather insertion ordered.

Do we have an option to not use random? How about a str-compare of tasks.peek() to make it deterministic?

crusaderky · 2022-06-17T11:33:24Z

distributed/worker_state_machine.py

+        Yield the peer workers and tasks in data_needed, sorted by:
+
+        1. first local, then remote
+        2. if tied, by highest-priority task available


I could trivially swap these two, reverting it to fetch higher priority tasks first as in main. As explained in the opening comment, this would cause up to 50MB worth of lower-priority tasks to be fetched from a remote host even if they're available on a local one.

I would indeed swap these and always prefer prioritized data

The previous behavior always chose priority (next in data_needed) and fetched this locally if there was a local worker available. If priority is listed first in this heap, we'd have the same behavior. I would suggest to only engage in any changes of behavior here if we can back it up with benchmarks

By preferring local workers I can see an edge case where many tasks / many GBs of data is on a couple of local workers but a couple of small high priority tasks are on a remote worker. These small high priority task would be needed to unblock an important downstream task (hence the high prio). We'd fetch a lot of data and risk even overflowing the worker until all local data is replicated before we even tried to fetch the remote data.

crusaderky · 2022-06-22T09:18:36Z

This PR is no longer blocked by #6593

fjetter

I'm slightly worried about runtimes for very large clusters (think ~1k workers upwards). Maybe this is not a big deal on the workers but I could see this amounting to a few orders of magnitude slower in the average case for large clusters
I think we should always prefer priority over anything else. Every other change should be verified with benchmarking
We had a similar argument when testing the Scheduler.rebalance algorithm in AMM but I'm again wondering if we should write dedicated unit tests for this, e.g. define these things as functions and test them in isolation

Testing could look like

DATA_NEEDED = dict[str, HeapSet[TaskState]]

def _select_workers_for_gather(
    host: str,
    data_needed: DATA_NEEDED,
    skip_workers: set[str] # union of in_flight_workers and busy_workers
) -> Iterator[tuple[str, HeapSet[TaskState]]]:
    ...

def _select_keys_for_gather(
    data_needed,
    available,
) -> tuple[list[TaskState], int]:
    ...


def _get_data_to_fetch(host: str, data_needed, skip_workers)-> Iterator[tuple[str, set[TaskState], float]]:
    for worker, available in _select_keys_for_gather(
        host=host,
        data_needed=data_needed,
        skip_workers=skip_workers
    ):
        yield worker, *_select_keys_for_gather(data_needed=data_needed, available)



def test_prefer_priority():
    tasks = []
    data_needed = defaultdict(HeapSet)
    host = "127.0.0.1"
    local_worker = f"{host}:1234"
    remote_worker = "10.5.6.7:5678"

    ts1 = TaskState(
        key="key", 
        priority=1, 
        nbytes=1, 
        who_has={remote_worker}
    )
    data_needed[remote_worker].push(ts1)
    ts1 = TaskState(
        key="key2", 
        priority=0, 
        nbytes=1, 
        who_has={local_worker}
    )
    data_needed[local_worker].push(ts1)

    assert list(_get_data_to_fetch(host, data_needed, [])) == [
        (remote_worker, {ts1}, 1)
        (local_worker, {ts2}, 1)
    ]

This would at least provide us some lever to test a few of the edge cases encountered in this implementation. For instance, the mutability of data_needed and the cross play between _select_keys_for_gather and _select_workers_for_gather

distributed/worker_state_machine.py

fjetter · 2022-06-22T09:48:37Z

distributed/worker_state_machine.py

+                    get_address_host(worker) != host,  # False < True
+                    tasks.peek().priority,
+                    -len(tasks),
+                    random.random(),
+                    worker,
+                    tasks,


I do like that we now have a very simple lever to control the fetching behavior +1

I would suggest not to engage in any large conversations here. I think we could theorycraft this for a while but eventually I'd be interested in real world benchmarks. We might want to revisit this once we have a couple of test workloads ready to go.

fjetter · 2022-06-22T09:54:54Z

distributed/worker_state_machine.py

+        Yield the peer workers and tasks in data_needed, sorted by:
+
+        1. first local, then remote
+        2. if tied, by highest-priority task available


I would indeed swap these and always prefer prioritized data

The previous behavior always chose priority (next in data_needed) and fetched this locally if there was a local worker available. If priority is listed first in this heap, we'd have the same behavior. I would suggest to only engage in any changes of behavior here if we can back it up with benchmarks

By preferring local workers I can see an edge case where many tasks / many GBs of data is on a couple of local workers but a couple of small high priority tasks are on a remote worker. These small high priority task would be needed to unblock an important downstream task (hence the high prio). We'd fetch a lot of data and risk even overflowing the worker until all local data is replicated before we even tried to fetch the remote data.

fjetter · 2022-06-22T09:59:07Z

distributed/worker_state_machine.py

+        Note
+        ----
+        Instead of number of tasks, we could've measured total nbytes and/or number of
+        tasks that only exist on the worker. Raw number of tasks is cruder but simpler.


Well, the top priority task was never random but defined by the heap. Even same priority tasks would be deterministic given deterministic heap internals.
Even the _select_keys_for_gather / data_needed_per_worker was not random but rather insertion ordered.

Do we have an option to not use random? How about a str-compare of tasks.peek() to make it deterministic?

fjetter · 2022-06-22T10:10:46Z

distributed/worker_state_machine.py

+        for worker, tasks in list(self.data_needed.items()):
+            if not tasks:
+                del self.data_needed[worker]
+                continue
+            if worker in self.in_flight_workers or worker in self.busy_workers:
+                continue


I know I initially suggested to go down this road but I would still like to raise the point that this implementation now would iterate over all workers every time we're calling _ensure_communicating

Even worse, if I'm not mistaken, the below iteration is not even linear (even without the additional push).

The iteration is O(n) to the number of peer workers that hold any data to fetch, and O(n*logn) to the number of workers that hold data to fetch and are neither in flight nor busy - split between O(n) in python bytecode * O(logn) in C.

In a worst-case scenario of a cluster with 1000 workers where you suddenly have data to fetch from all 1000 (and I seriously doubt this is realistic), you'd perform

a for loop in pure python of 1000 iterations

a single heapq.heapify on a list of 1000 elements, which costs O(n*logn) and is implemented in C

a for loop in pure python of 50 iterations (distributed.worker.connections.outgoing), each of which calls heapq.heappop on a list of 950~1000 elements, which costs O(logn) and is implemented in C

At the next call to _ensure_communicating, before any worker has responded, you'll just repeat step 1, while skipping steps 2 and 3.

I strongly suspect this whole thing to be negligible in most cases.

I can easily improve this, moving aside the busy/in flight workers to a separate attribute (e.g. WorkerState.busy_workers), thus making the second call O(1). However if you agree I'd rather do it in a follow-up PR.

fjetter · 2022-06-22T10:10:55Z

distributed/worker_state_machine.py

+                heapq.heappush(
+                    heap,
+                    (is_remote, tasks.peek().priority, -len(tasks), rnd, worker, tasks),
+                )


This is tying a knot in my head when thinking about runtime.

heappop/heappush are O(logn) and are implemented in C

not talking about the heappush itself but rather that it is pushing on the heap we're currently iterating over.

Yes, initially there was no heap, just a sort(). But then I realised that, if you have tasks with multiple replicas, your order after the first worker may change, so you need to re-sort. a heap is the most efficient way to do it.

crusaderky · 2022-06-22T13:50:32Z

OK, I'm swapping priority and local/remote around.

Well, the top priority task was never random but defined by the heap.

The previous algorithm was:

pick top priority task
given two tasks with the same priority, pick the one inserted first
of all hosts in ts.who_has, pick a random one among the local ones (see call to random.choice)
if no local peers exist, pick a random one among the remote ones
pick up to 50MB worth of lower-priority tasks from the same host, following priority first and insertion order next

Do we have an option to not use random? How about a str-compare of tasks.peek() to make it deterministic?

That would be pointless when the same task is available in multiple replicas from different workers.
You could str-compare worker addresses, but that would be disastrous:

when multiple workers are on the same host - lower-port workers would be hammered
on a network where IP addresses convey locality - e.g. there's a ticket somewhere from a user that had their cluster spanning two AWS regions

I'm adding a statically seeded random state to obtain reproducibility.

crusaderky · 2022-06-22T14:00:01Z

Would it be OK if I add unit tests (1) for the determinism and (2) to test that priority is chosen over locality as a follow-up after #6446? That PR makes it easier to write state tests.

fjetter · 2022-06-22T14:26:16Z

Would it be OK if I add unit tests (1) [...] after #6446?

sure

crusaderky · 2022-06-22T17:01:18Z

Added test about gather priority.
The performance test will be in https://github.com/dask/dask-benchmarks as discussed (PR to follow)

crusaderky · 2022-06-22T17:03:09Z

distributed/tests/test_worker_state_machine.py

+            to_gather={"x5"},
+            total_nbytes=4 * 2**20,
+        ),
+    ]


Writing this test filled me with pure, unadultered joy ❤️

crusaderky self-assigned this Jun 16, 2022

crusaderky force-pushed the WSMR/data_needed branch 2 times, most recently from d5324b7 to b1529f4 Compare June 17, 2022 11:29

crusaderky commented Jun 17, 2022

View reviewed changes

crusaderky force-pushed the WSMR/data_needed branch 3 times, most recently from 46065ff to 39d789e Compare June 17, 2022 12:36

crusaderky mentioned this pull request Jun 17, 2022

Alternatives for current ensure_communicating #6497

Closed

Deduplicate data_needed

4ecc014

crusaderky force-pushed the WSMR/data_needed branch from c31b7b6 to 4ecc014 Compare June 17, 2022 15:34

crusaderky mentioned this pull request Jun 17, 2022

Remove EnsureCommunicatingAfterTransitions #6462

Merged

crusaderky marked this pull request as ready for review June 17, 2022 16:08

crusaderky added 2 commits June 22, 2022 10:17

Revert dask#6593

2dd0f07

Merge branch 'main' into WSMR/data_needed

ef29202

fjetter reviewed Jun 22, 2022

View reviewed changes

crusaderky added 2 commits June 22, 2022 15:00

Swap priority and locality; add deterministic random sequence

0b9f1bd

doc

f116148

crusaderky added 2 commits June 22, 2022 16:23

Merge branch 'main' into WSMR/data_needed

6daf088

test_gather_priority

bca2f7a

crusaderky commented Jun 22, 2022

View reviewed changes

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 22, 2022

Deduplicate data_needed (dask#6587)

6db95e9

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 23, 2022

Deduplicate data_needed (dask#6587)

6699839

crusaderky mentioned this pull request Jun 23, 2022

Sets cause non-determinism in the WorkerState #6620

Open

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 23, 2022

Deduplicate data_needed (dask#6587)

5ede365

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 23, 2022

Deduplicate data_needed (dask#6587)

8309ef5

crusaderky mentioned this pull request Jun 24, 2022

Benchmark WorkerState._ensure_communicating dask/dask-benchmarks#50

Merged

Merge branch 'main' into WSMR/data_needed

e2d7a62

fjetter approved these changes Jun 24, 2022

View reviewed changes

crusaderky added 4 commits June 25, 2022 17:08

Merge branch 'main' into WSMR/data_needed

8d20d44

cosmetic

fd2b977

Merge branch 'main' into WSMR/data_needed

49190dc

merge

c8267eb

crusaderky merged commit c82bba5 into dask:main Jun 26, 2022

crusaderky deleted the WSMR/data_needed branch June 26, 2022 10:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate data_needed #6587

Deduplicate data_needed #6587

crusaderky commented Jun 16, 2022 •

edited

Loading

github-actions bot commented Jun 16, 2022 •

edited

Loading

crusaderky Jun 17, 2022 •

edited

Loading

fjetter Jun 22, 2022

crusaderky Jun 17, 2022 •

edited

Loading

fjetter Jun 22, 2022

crusaderky commented Jun 22, 2022

fjetter left a comment

fjetter Jun 22, 2022

fjetter Jun 22, 2022

fjetter Jun 22, 2022

fjetter Jun 22, 2022

fjetter Jun 22, 2022

crusaderky Jun 22, 2022 •

edited

Loading

fjetter Jun 22, 2022

crusaderky Jun 22, 2022

fjetter Jun 22, 2022

crusaderky Jun 22, 2022

crusaderky commented Jun 22, 2022 •

edited

Loading

crusaderky commented Jun 22, 2022

fjetter commented Jun 22, 2022

crusaderky commented Jun 22, 2022

crusaderky Jun 22, 2022

Deduplicate data_needed #6587

Deduplicate data_needed #6587

Conversation

crusaderky commented Jun 16, 2022 • edited Loading

github-actions bot commented Jun 16, 2022 • edited Loading

Unit Test Results

crusaderky Jun 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Jun 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Jun 22, 2022

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Jun 22, 2022 • edited Loading

crusaderky commented Jun 22, 2022

fjetter commented Jun 22, 2022

crusaderky commented Jun 22, 2022

Choose a reason for hiding this comment

crusaderky commented Jun 16, 2022 •

edited

Loading

github-actions bot commented Jun 16, 2022 •

edited

Loading

crusaderky Jun 17, 2022 •

edited

Loading

crusaderky Jun 17, 2022 •

edited

Loading

crusaderky Jun 22, 2022 •

edited

Loading

crusaderky commented Jun 22, 2022 •

edited

Loading