Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Bug: Error: EMFILE: too many open files, open at lazyFs.open (internal/fs/streams.js:273:12) #1620

Closed
apatsekin opened this issue Oct 17, 2019 · 9 comments
Assignees
Milestone

Comments

@apatsekin
Copy link
Contributor

apatsekin commented Oct 17, 2019

Short summary about the issue/question:
NNI control server crashes while running experiment for several days. The only signs of crash are found in nnictl log stderr [exp_id]. Everything else just looks like process disappeared.

Brief what process you are following: Running a nnicreate on one machine to create remote experiment which involves 2 workers (2 different machines). After 6 days of running nni control daemon fails with:

(node:41922) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 10569)
events.js:167
      throw er; // Unhandled 'error' event
      ^

Error: EMFILE: too many open files, open '/root/nni/experiments/cJd6Fzn0/log/nnimanager.log'
Emitted 'error' event at:
    at lazyFs.open (internal/fs/streams.js:273:12)
    at FSReqWrap.oncomplete (fs.js:141:20)

How to reproduce it:

nni Environment:
nni version: 1.0
nni mode(local|pai|remote): remote
OS: Ubuntu 16.04.6 LTS
python version: 3.6.8
is conda or virtualenv used?: no
is running in docker?: yes

need to update document(yes/no): no

Anything else we need to know:
Relevant logs:
nnimanager.log
nni_stderr.log
dispatcher.log

@suiguoxin suiguoxin added the bug Something isn't working label Oct 18, 2019
@ultmaster ultmaster removed the bug Something isn't working label Oct 20, 2019
@ultmaster
Copy link
Contributor

It looks to me that the upper limit of number of files opened is reached. Is there other processes running that would open a lot of files (like file sync, or database...).

@apatsekin
Copy link
Contributor Author

apatsekin commented Oct 23, 2019

So far every remote experiment eventually errors out with this message. After resuming this experiment it had been working for 5 days until next crash. Same error message.

events.js:167
      throw er; // Unhandled 'error' event
      ^

Error: EMFILE: too many open files, open '/root/nni/experiments/cJd6Fzn0/log/nnimanager.log'
Emitted 'error' event at:
    at lazyFs.open (internal/fs/streams.js:273:12)
    at FSReqWrap.oncomplete (fs.js:141:20)

also this warning in stderr is pretty annoying (takes MBs of logs):

(node:48255) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/root/.pyenv/versions/3.6.8/nni/training_service/remote_machine/remoteMachineTrainingService.js:331:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)

@ultmaster thanks for response! That's true, but the process which causes it is node, which is exclusively used by NNI. Check out below how number of opened file handlers grows within an hour after restarting. Please consider returning bug label to this thread.

root@gpu2-ru7 /h/a/p/simpledet# lsof > /tmp/lsof.log
root@gpu2-ru7 /h/a/p/simpledet# cat /tmp/lsof.log | awk '{ print $2 " " $1; }' | sort -rn | uniq -c | sort -rn | head -20
   2024 6244 node
    264 6267 python3
     38 29462 sshd
     29 5467 zsh
     29 1620 zsh
     29 1 zsh
     17 7982 lsof
     12 7983 lsof
     11 6266 sh
      1 PID COMMAND
root@gpu2-ru7 /h/a/p/simpledet# lsof > /tmp/lsof.log
root@gpu2-ru7 /h/a/p/simpledet# cat /tmp/lsof.log | awk '{ print $2 " " $1; }' | sort -rn | uniq -c | sort -rn | head -20
   3520 6244 node
    264 6267 python3
     38 29462 sshd
     29 5467 zsh
     29 1620 zsh
     29 1 zsh
     17 8119 lsof
     12 8120 lsof
     11 6266 sh
      1 PID COMMAND
root@gpu2-ru7 /h/a/p/simpledet# lsof > /tmp/lsof.log
root@gpu2-ru7 /h/a/p/simpledet# cat /tmp/lsof.log | awk '{ print $2 " " $1; }' | sort -rn | uniq -c | sort -rn | head -20
  21527 6244 node
    264 6267 python3
     38 29462 sshd
     29 5467 zsh
     29 1620 zsh
     29 1 zsh
     17 8249 lsof
     12 8250 lsof
     11 6266 sh
      1 PID COMMAND

@apatsekin
Copy link
Contributor Author

After 12 hours of run, node keeps 1.5m opened file handlers to log/nnimanager.log.

@SparkSnail
Copy link
Contributor

Hi, @apatsekin . Thanks for raising the issue, does this problem only occur on remote mode? will it happen on local mode?

@apatsekin
Copy link
Contributor Author

Hi @SparkSnail . Thanks for reacting! The same experiment run in a local mode doesn't cause this growth of unclosed handlers. So it's something specific to remote mode.

@apatsekin
Copy link
Contributor Author

apatsekin commented Nov 13, 2019

bump... any ideas? I'm sure it's a matter of one-line closing file handler in some final block in logging module.

**138492** 5883 node
   264 5905 python3
    38 3300 sshd
    29 6363 zsh
    29 4860 zsh
    29 3753 zsh
    29 1 zsh
    17 6525 lsof
    12 6526 lsof
    11 5904 sh
     1 PID COMMAND

@scarlett2018
Copy link
Member

@SparkSnail seems a remote mode problem.

@SparkSnail
Copy link
Contributor

Hi @apatsekin , thanks for raising the problem, I've checked NNI's code, and perhaps this problem is caused by https://github.com/microsoft/nni/blob/master/src/nni_manager/training_service/remote_machine/sshClientUtility.ts#L152. I'm investigating and will give a hot fix ASAP.

@scarlett2018 scarlett2018 added the bug Something isn't working label Nov 14, 2019
@scarlett2018 scarlett2018 added this to the Backlog milestone Nov 14, 2019
@SparkSnail
Copy link
Contributor

#1755

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants