Reset state of `ShuffleSchedulerExtension` on restart #7446

hendrikmakait · 2023-01-03T19:26:42Z

Fixes issue where forgotten shuffles could not be re-run after a cluster restart

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2023-01-03T20:28:41Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      22 files ±  0       22 suites ±0 10h 20m 46s ⏱️ + 50m 45s
  3 288 tests +  1   3 202 ✔️ +1     85 💤 ±0 1 ❌ ±0
36 099 runs +11 34 536 ✔️ +7 1 562 💤 +4 1 ❌ ±0

For more details on these failures, see this check.

Results for commit cd8ddff. ± Comparison against base commit b5a2078.

♻️ This comment has been updated with latest results.

fjetter · 2023-01-04T13:04:21Z

distributed/shuffle/tests/test_shuffle.py

+    # Cannot rerun forgotten shuffle due to tombstone
+    with pytest.raises(RuntimeError, match="shuffle_transfer"):
+        await c.compute(dd.shuffle.shuffle(df, "y", shuffle="p2p"))


Hold on, this is concerning. We cannot rerun the same shuffle even after it finished successfully?

Yes, due to the get-or-create style of shuffle_get, cancelled tasks could otherwise re-create a forgotten shuffle which would leave stale state indefinitely on the cluster. We should be able to solve this with something like #7372.

See also

distributed/distributed/shuffle/tests/test_shuffle.py

Lines 892 to 918 in 401b51d

@pytest.mark.xfail(reason="Tombstone prohibits multiple calls to head")

@gen_cluster(client=True, nthreads=[("127.0.0.1", 4)] * 2)

async def test_repeat(c, s, a, b):

df = dask.datasets.timeseries(

start="2000-01-01",

end="2000-01-10",

dtypes={"x": float, "y": float},

freq="100 s",

)

out = dd.shuffle.shuffle(df, "x", shuffle="p2p")

await c.compute(out.head(compute=False))

await clean_worker(a, timeout=2)

await clean_worker(b, timeout=2)

await clean_scheduler(s, timeout=2)

await c.compute(out.tail(compute=False))

await clean_worker(a, timeout=2)

await clean_worker(b, timeout=2)

await clean_scheduler(s, timeout=2)

await c.compute(out.head(compute=False))

await clean_worker(a, timeout=2)

await clean_worker(b, timeout=2)

await clean_scheduler(s, timeout=2)

Reset state on restart

fbb696b

hendrikmakait self-assigned this Jan 4, 2023

Add test

e094d5d

hendrikmakait marked this pull request as ready for review January 4, 2023 12:28

hendrikmakait marked this pull request as draft January 4, 2023 12:30

Improve test

cd8ddff

hendrikmakait marked this pull request as ready for review January 4, 2023 12:43

fjetter reviewed Jan 4, 2023

View reviewed changes

fjetter approved these changes Jan 5, 2023

View reviewed changes

fjetter merged commit 2768bbd into dask:main Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset state of `ShuffleSchedulerExtension` on restart #7446

Reset state of `ShuffleSchedulerExtension` on restart #7446

hendrikmakait commented Jan 3, 2023 •

edited

Loading

github-actions bot commented Jan 3, 2023 •

edited

Loading

fjetter Jan 4, 2023

hendrikmakait Jan 4, 2023

hendrikmakait Jan 4, 2023 •

edited

Loading

fjetter Jan 5, 2023

	@pytest.mark.xfail(reason="Tombstone prohibits multiple calls to head")
	@gen_cluster(client=True, nthreads=[("127.0.0.1", 4)] * 2)
	async def test_repeat(c, s, a, b):
	df = dask.datasets.timeseries(
	start="2000-01-01",
	end="2000-01-10",
	dtypes={"x": float, "y": float},
	freq="100 s",
	)
	out = dd.shuffle.shuffle(df, "x", shuffle="p2p")
	await c.compute(out.head(compute=False))

	await clean_worker(a, timeout=2)
	await clean_worker(b, timeout=2)
	await clean_scheduler(s, timeout=2)

	await c.compute(out.tail(compute=False))

	await clean_worker(a, timeout=2)
	await clean_worker(b, timeout=2)
	await clean_scheduler(s, timeout=2)

	await c.compute(out.head(compute=False))

	await clean_worker(a, timeout=2)
	await clean_worker(b, timeout=2)
	await clean_scheduler(s, timeout=2)

Reset state of ShuffleSchedulerExtension on restart #7446

Reset state of ShuffleSchedulerExtension on restart #7446

Conversation

hendrikmakait commented Jan 3, 2023 • edited Loading

github-actions bot commented Jan 3, 2023 • edited Loading

Unit Test Results

fjetter Jan 4, 2023

Choose a reason for hiding this comment

hendrikmakait Jan 4, 2023

Choose a reason for hiding this comment

hendrikmakait Jan 4, 2023 • edited Loading

Choose a reason for hiding this comment

fjetter Jan 5, 2023

Choose a reason for hiding this comment

Reset state of `ShuffleSchedulerExtension` on restart #7446

Reset state of `ShuffleSchedulerExtension` on restart #7446

hendrikmakait commented Jan 3, 2023 •

edited

Loading

github-actions bot commented Jan 3, 2023 •

edited

Loading

hendrikmakait Jan 4, 2023 •

edited

Loading