Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ananse network crashes! #140

Closed
JGASmits opened this issue Oct 5, 2021 · 7 comments
Closed

Ananse network crashes! #140

JGASmits opened this issue Oct 5, 2021 · 7 comments

Comments

@JGASmits
Copy link
Contributor

JGASmits commented Oct 5, 2021

Ananse network crashes when running on cn106! Not sure what changed (something server-related?)

But running ananse network from the conda version of ananse (0.3) results in the following error on cn06:


2021-10-05 14:49:16 | INFO | Loading expression
2021-10-05 14:50:26 | INFO | Binding file contains 814 TFs.
2021-10-05 14:50:28 | INFO | Aggregating binding for genes on chr1
(aggregation timebars for all chroms)
2021-10-05 15:00:44 | INFO | Reading factor activity
2021-10-05 15:00:44 | INFO | Computing network
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.worker - ERROR - Worker stream died during communication: tcp://131.174.136.191:34449
Traceback (most recent call last):
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 208, in read
    n = await stream.read_into(chunk)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2387, in gather_dep
    response = await get_data_from_worker(
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3759, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3739, in _get_data
    response = await send_recv(
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 651, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 214, in read
    convert_stream_closed_error(self, e)
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://131.174.136.191:45164 remote=tcp://131.174.136.191:34449>: Stream is closed
distributed.worker - ERROR - Worker stream died during communication: tcp://131.174.136.191:34449
Traceback (most recent call last):
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 198, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2387, in gather_dep
    response = await get_data_from_worker(
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3759, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3739, in _get_data
    response = await send_recv(
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 651, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 214, in read
    convert_stream_closed_error(self, e)
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://131.174.136.191:45168 remote=tcp://131.174.136.191:34449>: Stream is closed
distributed.worker - ERROR - Worker stream died during communication: tcp://131.174.136.191:34449
Traceback (most recent call last):
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2387, in gather_dep
    response = await get_data_from_worker(
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3759, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3739, in _get_data
    response = await send_recv(
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 651, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 214, in read
    convert_stream_closed_error(self, e)
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://131.174.136.191:45150 remote=tcp://131.174.136.191:34449>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.nanny - WARNING - Restarting worker
distributed.worker - WARNING - Worker is at 90% memory usage. Pausing worker.  Process memory: 10.08 GiB -- Worker memory limit: 11.18 GiB
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.worker - WARNING - Worker is at 91% memory usage. Pausing worker.  Process memory: 10.20 GiB -- Worker memory limit: 11.18 GiB
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.worker - WARNING - Worker is at 87% memory usage. Pausing worker.  Process memory: 9.75 GiB -- Worker memory limit: 11.18 GiB
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
Traceback (most recent call last):
  File "/vol/mbconda/jsmits/envs/ananse/bin/ananse", line 369, in <module>
    args.func(args)
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/ananse/commands/network.py", line 46, in network
    b.run_network(
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/ananse/network.py", line 673, in run_network
    result = result.compute()
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/dask/base.py", line 288, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/dask/base.py", line 570, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/client.py", line 2689, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/client.py", line 1966, in gather
    return self.sync(
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/client.py", line 860, in sync
    return sync(
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/utils.py", line 326, in sync
    raise exc.with_traceback(tb)
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/utils.py", line 309, in f
    result[0] = yield future
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/vol/mbconda/jsmits/envs/ananse/lib/python3.9/site-packages/distributed/client.py", line 1831, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ("('getitem-f717d1ecd9f8241fd8481d17450fb255', 0)", <WorkerState 'tcp://131.174.
@simonvh
Copy link
Member

simonvh commented Oct 5, 2021

This was a very tricky issue to solve before.
Can with fewer cores/threads?

@JGASmits
Copy link
Contributor Author

JGASmits commented Oct 5, 2021

I was running it with 4 cores, I will try to see if running it with a single core fixes it.

@JGASmits
Copy link
Contributor Author

JGASmits commented Oct 5, 2021

Running it with 1 core seems to work!

@JGASmits JGASmits closed this as completed Oct 5, 2021
@Maarten-vd-Sande
Copy link
Member

Does it make sense to catch these errors and print a statement to use less cores? Or just limit the maximum amount of cores that can be used?

@simonvh
Copy link
Member

simonvh commented Oct 5, 2021

I'm not sure the errors are consistent, but catching would be a good idea. Maximum number of cores really depend on available resources and data set specifics.

@Maarten-vd-Sande
Copy link
Member

Another thing is that someone (not me 🙃 ) could try is converting each gene into an integer or sth. If you have long gene names this might save quite some memory

@simonvh
Copy link
Member

simonvh commented Oct 6, 2021

That could easily be done by using the categorical data type. This may actually save quite a lot of memory, due to the TF x target combinations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants