Ensure reconnecting workers do not loose required data #5436

fjetter · 2021-10-18T12:56:07Z

Again a deadlock masked by mark.repeat flags. We should be more careful with these flags. This time, at least, this showed up as AssertionErrors in tests

If a worker briefly disconnects it has the chance to register what tasks it already has in memory and is currently executing. If it has data which is already released, it is also asked to release this. However, this introduces a subtle race condition where the worker is asked to release all it's data although we simultaneously would expect him to compute that keys dependent. Therefore, we should use the safe handler (remove-replica) instead of the strict handler (free-keys). See comment below for the proper section.

I took this chance to clean up the signature around free-keys to be in alignment with the rest

Closes test_worker_reconnects_mid_compute_multiple_states_on_scheduler flaky #5377

fjetter · 2021-10-18T12:56:31Z

distributed/scheduler.py

@@ -4339,9 +4343,9 @@ async def add_worker(
                        worker_msgs[address] = []
                    worker_msgs[address].append(
                        {
-                            "op": "free-keys",
+                            "op": "remove-replicas",


This is the relevant change

Ensure reconnecting workers do not loose required data

2545768

fjetter commented Oct 18, 2021

View reviewed changes

fjetter requested a review from crusaderky October 18, 2021 12:56

fjetter added 2 commits October 18, 2021 17:20

fix free_keys calls

19d4bb8

attach timestamp to stim ID of delete_worker_data

f9bb38f

crusaderky approved these changes Oct 18, 2021

View reviewed changes

fjetter merged commit 7d2516a into dask:main Oct 18, 2021

fjetter deleted the fix_flaky_reconnect_worker_tests branch October 18, 2021 17:16

jrbourbeau mentioned this pull request Oct 18, 2021

test_worker_reconnects_mid_compute_multiple_states_on_scheduler flaky #5377

Closed

zanieb pushed a commit to zanieb/distributed that referenced this pull request Oct 28, 2021

Ensure reconnecting workers do not loose required data (dask#5436)

e8b47eb

fjetter mentioned this pull request Jan 21, 2022

Conditions under which a TCP connection may fail / close? #5678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure reconnecting workers do not loose required data #5436

Ensure reconnecting workers do not loose required data #5436

fjetter commented Oct 18, 2021

fjetter Oct 18, 2021

Ensure reconnecting workers do not loose required data #5436

Ensure reconnecting workers do not loose required data #5436

Conversation

fjetter commented Oct 18, 2021

fjetter Oct 18, 2021

Choose a reason for hiding this comment