Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_release_evloop_while_spilling failing on MacOSX #6233

Closed
fjetter opened this issue Apr 28, 2022 · 4 comments · Fixed by #6291
Closed

test_release_evloop_while_spilling failing on MacOSX #6233

fjetter opened this issue Apr 28, 2022 · 4 comments · Fixed by #6291
Assignees
Labels
flaky test Intermittent failures on CI.

Comments

@fjetter
Copy link
Member

fjetter commented Apr 28, 2022

This test was introduced as part of #6189 and is failing on main with

https://github.com/dask/distributed/runs/6205707003?check_suite_focus=true
and
https://github.com/dask/distributed/runs/6205707211?check_suite_focus=true

E               asyncio.exceptions.TimeoutError: Test timeout after 30s.
E               ========== Test stack trace starts here ==========
E               Stack for <Task pending name='Task-173075' coro=<test_release_evloop_while_spilling() running at /Users/runner/work/distributed/distributed/distributed/tests/test_worker_memory.py:799>> (most recent call last):
E                 File "/Users/runner/work/distributed/distributed/distributed/tests/test_worker_memory.py", line 799, in test_release_evloop_while_spilling
E                   await asyncio.sleep(0)

distributed/utils_test.py:1056: TimeoutError
----------------------------- Captured stdout call -----------------------------
Dumped cluster state to test_cluster_dump/test_release_evloop_while_spilling.yaml
----------------------------- Captured stderr call -----------------------------
2022-04-28 06:44:51,918 - distributed.spill - ERROR - Spill to disk failed; keeping data in memory
Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/spill.py", line 115, in handle_errors
    yield
  File "/Users/runner/work/distributed/distributed/distributed/spill.py", line 211, in evict
    _, _, weight = self.fast.evict()
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.8/site-packages/zict/lru.py", line 100, in evict
    cb(k, v)
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.8/site-packages/zict/buffer.py", line 62, in fast_to_slow
    self.slow[key] = value
  File "/Users/runner/work/distributed/distributed/distributed/spill.py", line 312, in __setitem__
    self.d[key] = pickled
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.8/site-packages/zict/file.py", line 86, in __setitem__
    with open(fn, "wb") as fh:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/runner/work/distributed/distributed/dask-worker-space/worker-01059kdd/storage/SlowSpill-60cdae9f-d152-4278-a7db-abb4e617229c'

cc @crusaderky

@fjetter fjetter added the flaky test Intermittent failures on CI. label Apr 28, 2022
@crusaderky
Copy link
Collaborator

crusaderky commented Apr 28, 2022

I... have no idea what to do about it? The test relies on having a functioning hard drive to spill to.
What's unique about this test is that it's hitting the hard disk with 100x open->1 write->close in a very short burst. Which is... a reasonable thing to ask for?
It looks like the OS either randomly nuked the spill directory, or is improperly raising ENOENT on an unrelated I/O error (I would not be surprised with OSX).

I initially thought of collisions between this test and a late cleanup of another test, but (1) that would be visible in other spill-related tests and (2) the worker-01059kdd directory is created by tempfile.mkdtemp (distributed.distutils:48) so I would assume it's reasonably robust.

The only thing I can think of is make it fail faster and xfail it on MacOSX?

@crusaderky crusaderky changed the title test_release_evloop_while_spilling failing test_release_evloop_while_spilling failing on MacOSX Apr 28, 2022
@fjetter
Copy link
Member Author

fjetter commented Apr 29, 2022

Could the spill buffer for this test simply use an in-memory file/buffer instead of disk?

@crusaderky
Copy link
Collaborator

crusaderky commented Apr 29, 2022

This is ridiculous:

Ubuntu:

3.12s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[2]
3.06s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[5]
3.06s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[3]
3.04s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[4]
3.03s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[1]
3.02s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[7]
3.00s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[0]
2.99s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[6]
2.99s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[8]
2.99s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[9]

Windows:

3.04s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[5]
3.04s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[1]
3.03s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[8]
3.03s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[7]
3.03s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[0]
3.03s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[3]
3.03s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[4]
3.03s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[2]
3.02s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[6]
3.01s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[9]

MacOSX:

37.47s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[5]
36.59s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[0]
12.91s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[4]
12.77s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[9]
12.64s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[6]
12.61s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[2]
12.54s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[8]
12.30s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[1]
12.12s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[7]
11.75s call     distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[3]
FAILED distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[0]
FAILED distributed/tests/test_worker_memory.py::test_release_evloop_while_spilling[5]

@fjetter
Copy link
Member Author

fjetter commented Apr 29, 2022

I would suggest to just skip it on OSX and link to this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
2 participants