Support dumping cluster state to URL #5863

gjoseph92 · 2022-02-24T19:44:14Z

Adds support for passing any fsspec.open-compatible URL to Client.dump_cluster_state, in which case the scheduler will generate the cluster dump itself and write it directly to that URL, without transferring anything back to the client.

Previously, the logic to consolidate state/version information from the workers and the scheduler happened via multiple RPCs on the client. Now, this is moved to a Scheduler.get_cluster_state method. You might think this would lead to worse performance, since there's less overlap of communication, but given the way broadcast works (the messages all got de-serialized and re-serialized on the scheduler anyway, without streaming, and the scheduler's event loop is blocked anyway by _to_dict), I don't think this change should make much performance difference. I haven't tested that though.

cc @crusaderky @fjetter @graingert @sevberg

Closes Make writing cluster state dumps to buckets easier #5659
Tests added / passed
Passes pre-commit run --all-files

Full state dumps (not just scheduler state) are now generated scheduler-side, then sent wholesale back to the client. Scheduler `dump_cluster_state` can write the state directly to a URL via fsspec. Still need to hook this up to a client method.

It's not done, just going to move it into a comment on GitHub instead

distributed/tests/test_scheduler.py

distributed/scheduler.py

github-actions · 2022-02-24T22:49:03Z

Unit Test Results

      12 files ±      0       12 suites ±0 5h 46m 49s ⏱️ - 1h 22m 40s
  2 639 tests +    15   2 555 ✔️ +    17   80 💤 ±    0 4 ❌ - 2
12 977 runs - 2 691 12 333 ✔️ - 2 470 637 💤 - 222 7 ❌ +1

For more details on these failures, see this check.

Results for commit 879f847. ± Comparison against base commit de94b40.

♻️ This comment has been updated with latest results.

distributed/scheduler.py

distributed/tests/test_client.py

distributed/scheduler.py

fjetter · 2022-02-28T15:58:47Z

distributed/scheduler.py

+        if format == "msgpack":
+            import msgpack
+
+            # NOTE: `compression="infer"` will automatically use gzip via the `.gz` suffix
+            mode = "wb"
+            suffix = ".msgpack.gz"
+            if not url.endswith(suffix):
+                url += suffix
+            writer = msgpack.pack
+        elif format == "yaml":
+            import yaml
+
+            mode = "w"
+            suffix = ".yaml"
+            if not url.endswith(suffix):
+                url += suffix
+
+            def writer(state: dict, f):
+                # YAML adds unnecessary `!!python/tuple` tags; convert tuples to lists to avoid them.
+                def tuple_to_list(node):
+                    if isinstance(node, (list, tuple)):
+                        return [tuple_to_list(el) for el in node]
+                    elif isinstance(node, dict):
+                        return {k: tuple_to_list(v) for k, v in node.items()}
+                    else:
+                        return node
+
+                state = tuple_to_list(state)
+                yaml.dump(state, f)


Do we really want this to be duplicated? We already have a drift in the two versions (just a comment). Can't this function be refactored to be reused by local and remote?

Yeah I don't like the duplication either. There are a couple subtle differences right now:

Client runs tuple_to_list for both msgpack and yaml because of serialization; dump_cluster_state_to_url only runs it for yaml. I'd guess doing it for msgpack isn't necessary anywhere, so this shouldn't matter.

dump_cluster_state_to_url uses fsspec to open the file; client uses plain open. This gives you a way to get a cluster dump if you don't want to install fsspec.

Neither of those seem like huge deals to me, so I'll make it a shared function. Just not sure where to put it besides the dreaded utils.py.

How about a new module? The stuff in #5873 needs to live somewhere as well

I'd guess doing it for msgpack isn't necessary anywhere, so this shouldn't matter.

symmetry. Having everything identical regardless of the output format is nice since every code you write works on both files. One thing I actually do often is to convert a received msgpack file to yaml which allows me to grep stuff 🙈

symmetry. Having everything identical regardless of the output format is nice

Converting tuples to lists in the input to msgpack doesn't matter though, since the msgpack itself effectively turns tuples into lists in the dumping process (since tuples are represented as lists). And for large cluster dumps, I think saving a full traversal and copy of the state is worthwhile.

One thing I actually do often is to convert a received msgpack file to yaml which allows me to grep stuff

I do this too. When you read in the msgpack though, it'll be all lists anyway, so the format is symmetrical when you then dump to yaml.

Converting tuples to lists in the input to msgpack doesn't matter though, since the msgpack itself effectively turns tuples into lists in the dumping process (since tuples are represented as lists)

afaik, you'll need to toggle this specifically. IIRC, msgpack load will reconstruct lists (which is horribly slow)

distributed/compatibility.py

fjetter

There appears to be a related test failure

FAILED distributed/tests/test_cluster_dump.py::test_url_and_writer_yaml - Typ...

…to-url

distributed/scheduler.py

distributed/client.py

Just feels easier and more consistent

gjoseph92 · 2022-03-09T21:53:16Z

Possibly some #5869 in tests, but at least tests from this PR aren't failing?

FAILED distributed/tests/test_cancelled_state.py::test_worker_stream_died_during_comm
FAILED distributed/tests/test_scheduler.py::test_missing_data_errant_worker
Error: Process completed with exit code 1
Error: Process completed with exit code 1

fjetter · 2022-03-10T09:54:02Z

From what I can tell all test failures are unrelated. There are also a bunch of #5910 failures.
Thank you @gjoseph92 !

gjoseph92 added 4 commits February 23, 2022 20:47

Move state-dumping logic into scheduler

0172c55

Full state dumps (not just scheduler state) are now generated scheduler-side, then sent wholesale back to the client. Scheduler `dump_cluster_state` can write the state directly to a URL via fsspec. Still need to hook this up to a client method.

Dump to URL from client by passing URL

2848f26

clean up readability

ffa5bb9

remove TODO from test

743b3bd

It's not done, just going to move it into a comment on GitHub instead

gjoseph92 commented Feb 24, 2022

View reviewed changes

distributed/tests/test_scheduler.py Show resolved Hide resolved

distributed/scheduler.py Outdated Show resolved Hide resolved

gjoseph92 commented Feb 24, 2022

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

fjetter reviewed Feb 28, 2022

View reviewed changes

gjoseph92 added 6 commits March 3, 2022 14:11

importorskip

9c67358

move to_thread into compatibility.py

9c54a11

Common logic for serializing dumps

643b8d1

hide non-3.8 type annotation in string

5e2d3c6

Merge remote-tracking branch 'upstream/main' into pr/gjoseph92/5863

1c94d52

missed another Collection[str]

88bc145

fjetter reviewed Mar 4, 2022

View reviewed changes

distributed/compatibility.py Outdated Show resolved Hide resolved

fjetter approved these changes Mar 4, 2022

View reviewed changes

fjetter requested changes Mar 4, 2022

View reviewed changes

gjoseph92 self-assigned this Mar 8, 2022

ian-r-rose self-requested a review March 8, 2022 17:05

gjoseph92 added 2 commits March 8, 2022 12:36

non-deprecated yaml load

ca1fa75

Merge remote-tracking branch 'upstream/main' into dump-cluster-state-…

e6537fb

…to-url

ian-r-rose reviewed Mar 8, 2022

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

distributed/client.py Outdated Show resolved Hide resolved

distributed/client.py Outdated Show resolved Hide resolved

gjoseph92 added 3 commits March 8, 2022 18:43

use fsspec everywhere

8f4e212

storage_options always as kwargs

0739776

Just feels easier and more consistent

write_from_scheduler flag and fix coro leak

f204e07

fjetter mentioned this pull request Mar 9, 2022

Utility functions to debug cluster_dump #5873

Closed

sjperkins mentioned this pull request Mar 9, 2022

Cluster dump utilities #5920

Merged

3 tasks

gjoseph92 requested a review from fjetter March 9, 2022 18:13

fix tests

879f847

fjetter merged commit 30f0b60 into dask:main Mar 10, 2022

gjoseph92 deleted the dump-cluster-state-to-url branch March 10, 2022 17:05

ian-r-rose mentioned this pull request Mar 11, 2022

[Stack Overflow] Dask worker post-processing coiled/dask-community#635

Open

jacobtomlinson mentioned this pull request Jul 26, 2022

Make writing performance reports to remote filesystems easier #6555

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dumping cluster state to URL #5863

Support dumping cluster state to URL #5863

gjoseph92 commented Feb 24, 2022

github-actions bot commented Feb 24, 2022 •

edited

Loading

fjetter Feb 28, 2022

gjoseph92 Feb 28, 2022

fjetter Mar 1, 2022

gjoseph92 Mar 3, 2022

fjetter Mar 4, 2022

fjetter left a comment

gjoseph92 commented Mar 9, 2022

fjetter commented Mar 10, 2022

Support dumping cluster state to URL #5863

Support dumping cluster state to URL #5863

Conversation

gjoseph92 commented Feb 24, 2022

github-actions bot commented Feb 24, 2022 • edited Loading

Unit Test Results

fjetter Feb 28, 2022

Choose a reason for hiding this comment

gjoseph92 Feb 28, 2022

Choose a reason for hiding this comment

fjetter Mar 1, 2022

Choose a reason for hiding this comment

gjoseph92 Mar 3, 2022

Choose a reason for hiding this comment

fjetter Mar 4, 2022

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

gjoseph92 commented Mar 9, 2022

fjetter commented Mar 10, 2022

github-actions bot commented Feb 24, 2022 •

edited

Loading