Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many file descriptors in select() on windows #686

Closed
jreback opened this issue Nov 17, 2016 · 9 comments
Closed

too many file descriptors in select() on windows #686

jreback opened this issue Nov 17, 2016 · 9 comments

Comments

@jreback
Copy link
Contributor

jreback commented Nov 17, 2016

on windows (using 750 workers, 12000 tasks, scheduler on linux though). stock python build.

dask - 0.12.0
distributed - 1.14.3

some processes can error out with
too many file descriptors in select()

@jreback
Copy link
Contributor Author

jreback commented Nov 17, 2016

cc @pitrou @mrocklin

@pitrou
Copy link
Member

pitrou commented Nov 17, 2016

Are those processes client processes or worker processes? If a client is gathering data from many different workers, it may end up having too many active connections at once.

@jreback
Copy link
Contributor Author

jreback commented Nov 17, 2016

these are worker processes. (1 thread each). I decided to partition my data more (because having some other issues).

@mrocklin
Copy link
Member

@pitrou likely workers. The client normally only communicates with the scheduler. All data to the client is routed through the scheduler because it is common for workers to not be publicly visible.

@jreback what kind of computation are you doing?

My guess is that a few of the workers are either serving/requesting data to/from many other workers. On the requesting side we could try to ensure that we only collect from a few workers at once. There are some possible performance benefits to this as well for large shuffles.

On the serving side I don't know how to limit Tornado to refuse connections or to ensure that they're cleaning up well.

@jreback are you able to provide a full traceback? It would be interesting to see where this problem arose.

@jreback
Copy link
Contributor Author

jreback commented Nov 17, 2016

Traceback (most recent call last):
  File "C:\Users\distributedagent\AppData\Local\Temp\Domains\Unknown_537814893\shared\grid\gridtask.py", line 60, in <module>
    result = function(funcArgs)
  File "C:\dev\p4\bmc\src\python2\shared\grid\dasklauncher.py", line 87, in start_worker
  File "C:\Anaconda\envs\bmc3_647D667BAA5F524104DFFF7AC41BFB6311B53FE1\lib\site-packages\tornado\ioloop.py", line 452, in run_sync
    self.start()
  File "C:\Anaconda\envs\bmc3_647D667BAA5F524104DFFF7AC41BFB6311B53FE1\lib\site-packages\tornado\ioloop.py", line 862, in start
    event_pairs = self._impl.poll(poll_timeout)
  File "C:\Anaconda\envs\bmc3_647D667BAA5F524104DFFF7AC41BFB6311B53FE1\lib\site-packages\tornado\platform\select.py", line 63, in poll
    self.read_fds, self.write_fds, self.error_fds, timeout)
ValueError: too many file descriptors in select()

@pitrou
Copy link
Member

pitrou commented Nov 17, 2016

I've filed ContinuumIO/anaconda-issues#1241

@jreback
Copy link
Contributor Author

jreback commented Nov 17, 2016

This is a simple load from a remote source (kind of like s3, but not exactly).

which has worked flawlessly for quite a while. Difference now is I am using 2x partitions) (was 8000, now 16000). And with a few more workers.

@jreback
Copy link
Contributor Author

jreback commented Nov 17, 2016

so same exact computation worked perfectly when I had 4000 tasks (and 500 cores) instead.

@mrocklin
Copy link
Member

We have significantly reduced the number of open sockets between workers and the scheduler, but there are at least one or two per worker process. The scheduler must have the ability to open at least this many files. On windows this is tricky, because this limit is hard coded. We have increased this hard-coded limit in the conda-forge and conda defaults recipes to something like two thousand. For the moment I think that this is all that we can do without significant changes. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants