-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Bug: Error: EMFILE: too many open files, open at lazyFs.open (internal/fs/streams.js:273:12) #1620
Comments
It looks to me that the upper limit of number of files opened is reached. Is there other processes running that would open a lot of files (like file sync, or database...). |
So far every
also this warning in stderr is pretty annoying (takes MBs of logs):
@ultmaster thanks for response! That's true, but the process which causes it is
|
After 12 hours of run, |
Hi, @apatsekin . Thanks for raising the issue, does this problem only occur on remote mode? will it happen on local mode? |
Hi @SparkSnail . Thanks for reacting! The same experiment run in a local mode doesn't cause this growth of unclosed handlers. So it's something specific to |
bump... any ideas? I'm sure it's a matter of one-line closing file handler in some
|
@SparkSnail seems a remote mode problem. |
Hi @apatsekin , thanks for raising the problem, I've checked NNI's code, and perhaps this problem is caused by https://github.com/microsoft/nni/blob/master/src/nni_manager/training_service/remote_machine/sshClientUtility.ts#L152. I'm investigating and will give a hot fix ASAP. |
Short summary about the issue/question:
NNI control server crashes while running experiment for several days. The only signs of crash are found in
nnictl log stderr [exp_id]
. Everything else just looks like process disappeared.Brief what process you are following: Running a
nnicreate
on one machine to createremote
experiment which involves 2 workers (2 different machines). After 6 days of running nni control daemon fails with:How to reproduce it:
nni Environment:
nni version: 1.0
nni mode(local|pai|remote): remote
OS: Ubuntu 16.04.6 LTS
python version: 3.6.8
is conda or virtualenv used?: no
is running in docker?: yes
need to update document(yes/no): no
Anything else we need to know:
Relevant logs:
nnimanager.log
nni_stderr.log
dispatcher.log
The text was updated successfully, but these errors were encountered: