too many file descriptors in select() on windows #686

jreback · 2016-11-17T17:14:01Z

on windows (using 750 workers, 12000 tasks, scheduler on linux though). stock python build.

dask - 0.12.0
distributed - 1.14.3

some processes can error out with
too many file descriptors in select()

The text was updated successfully, but these errors were encountered:

jreback · 2016-11-17T17:14:08Z

pitrou · 2016-11-17T17:17:01Z

Are those processes client processes or worker processes? If a client is gathering data from many different workers, it may end up having too many active connections at once.

jreback · 2016-11-17T17:20:15Z

these are worker processes. (1 thread each). I decided to partition my data more (because having some other issues).

mrocklin · 2016-11-17T17:21:38Z

@pitrou likely workers. The client normally only communicates with the scheduler. All data to the client is routed through the scheduler because it is common for workers to not be publicly visible.

@jreback what kind of computation are you doing?

My guess is that a few of the workers are either serving/requesting data to/from many other workers. On the requesting side we could try to ensure that we only collect from a few workers at once. There are some possible performance benefits to this as well for large shuffles.

On the serving side I don't know how to limit Tornado to refuse connections or to ensure that they're cleaning up well.

@jreback are you able to provide a full traceback? It would be interesting to see where this problem arose.

jreback · 2016-11-17T17:25:12Z

Traceback (most recent call last):
  File "C:\Users\distributedagent\AppData\Local\Temp\Domains\Unknown_537814893\shared\grid\gridtask.py", line 60, in <module>
    result = function(funcArgs)
  File "C:\dev\p4\bmc\src\python2\shared\grid\dasklauncher.py", line 87, in start_worker
  File "C:\Anaconda\envs\bmc3_647D667BAA5F524104DFFF7AC41BFB6311B53FE1\lib\site-packages\tornado\ioloop.py", line 452, in run_sync
    self.start()
  File "C:\Anaconda\envs\bmc3_647D667BAA5F524104DFFF7AC41BFB6311B53FE1\lib\site-packages\tornado\ioloop.py", line 862, in start
    event_pairs = self._impl.poll(poll_timeout)
  File "C:\Anaconda\envs\bmc3_647D667BAA5F524104DFFF7AC41BFB6311B53FE1\lib\site-packages\tornado\platform\select.py", line 63, in poll
    self.read_fds, self.write_fds, self.error_fds, timeout)
ValueError: too many file descriptors in select()

pitrou · 2016-11-17T17:26:32Z

I've filed ContinuumIO/anaconda-issues#1241

jreback · 2016-11-17T17:26:38Z

This is a simple load from a remote source (kind of like s3, but not exactly).

which has worked flawlessly for quite a while. Difference now is I am using 2x partitions) (was 8000, now 16000). And with a few more workers.

jreback · 2016-11-17T17:43:12Z

so same exact computation worked perfectly when I had 4000 tasks (and 500 cores) instead.

mrocklin · 2017-01-25T13:16:36Z

We have significantly reduced the number of open sockets between workers and the scheduler, but there are at least one or two per worker process. The scheduler must have the ability to open at least this many files. On windows this is tricky, because this limit is hard coded. We have increased this hard-coded limit in the conda-forge and conda defaults recipes to something like two thousand. For the moment I think that this is all that we can do without significant changes. Closing.

mrocklin closed this as completed Jan 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too many file descriptors in select() on windows #686

too many file descriptors in select() on windows #686

jreback commented Nov 17, 2016

jreback commented Nov 17, 2016

pitrou commented Nov 17, 2016

jreback commented Nov 17, 2016

mrocklin commented Nov 17, 2016

jreback commented Nov 17, 2016

pitrou commented Nov 17, 2016

jreback commented Nov 17, 2016 •

edited

Loading

jreback commented Nov 17, 2016

mrocklin commented Jan 25, 2017

too many file descriptors in select() on windows #686

too many file descriptors in select() on windows #686

Comments

jreback commented Nov 17, 2016

jreback commented Nov 17, 2016

pitrou commented Nov 17, 2016

jreback commented Nov 17, 2016

mrocklin commented Nov 17, 2016

jreback commented Nov 17, 2016

pitrou commented Nov 17, 2016

jreback commented Nov 17, 2016 • edited Loading

jreback commented Nov 17, 2016

mrocklin commented Jan 25, 2017

jreback commented Nov 17, 2016 •

edited

Loading