Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset state of ShuffleSchedulerExtension on restart #7446

Merged

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Jan 3, 2023

Fixes issue where forgotten shuffles could not be re-run after a cluster restart

  • Tests added / passed
  • Passes pre-commit run --all-files

@github-actions
Copy link
Contributor

github-actions bot commented Jan 3, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       22 files  ±  0         22 suites  ±0   10h 20m 46s ⏱️ + 50m 45s
  3 288 tests +  1    3 202 ✔️ +1       85 💤 ±0  1 ±0 
36 099 runs  +11  34 536 ✔️ +7  1 562 💤 +4  1 ±0 

For more details on these failures, see this check.

Results for commit cd8ddff. ± Comparison against base commit b5a2078.

♻️ This comment has been updated with latest results.

@hendrikmakait hendrikmakait self-assigned this Jan 4, 2023
@hendrikmakait hendrikmakait marked this pull request as ready for review January 4, 2023 12:28
@hendrikmakait hendrikmakait marked this pull request as draft January 4, 2023 12:30
@hendrikmakait hendrikmakait marked this pull request as ready for review January 4, 2023 12:43
Comment on lines +1030 to +1032
# Cannot rerun forgotten shuffle due to tombstone
with pytest.raises(RuntimeError, match="shuffle_transfer"):
await c.compute(dd.shuffle.shuffle(df, "y", shuffle="p2p"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on, this is concerning. We cannot rerun the same shuffle even after it finished successfully?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, due to the get-or-create style of shuffle_get, cancelled tasks could otherwise re-create a forgotten shuffle which would leave stale state indefinitely on the cluster. We should be able to solve this with something like #7372.

Copy link
Member Author

@hendrikmakait hendrikmakait Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also

@pytest.mark.xfail(reason="Tombstone prohibits multiple calls to head")
@gen_cluster(client=True, nthreads=[("127.0.0.1", 4)] * 2)
async def test_repeat(c, s, a, b):
df = dask.datasets.timeseries(
start="2000-01-01",
end="2000-01-10",
dtypes={"x": float, "y": float},
freq="100 s",
)
out = dd.shuffle.shuffle(df, "x", shuffle="p2p")
await c.compute(out.head(compute=False))
await clean_worker(a, timeout=2)
await clean_worker(b, timeout=2)
await clean_scheduler(s, timeout=2)
await c.compute(out.tail(compute=False))
await clean_worker(a, timeout=2)
await clean_worker(b, timeout=2)
await clean_scheduler(s, timeout=2)
await c.compute(out.head(compute=False))
await clean_worker(a, timeout=2)
await clean_worker(b, timeout=2)
await clean_scheduler(s, timeout=2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #7452

@fjetter fjetter merged commit 2768bbd into dask:main Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants