Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful reload doesn't wait for new process to launch before killing old one #3078

Open
ExplodingCabbage opened this issue Aug 9, 2017 · 25 comments

Comments

@ExplodingCabbage
Copy link

What's going wrong?

Graceful reload basically doesn't work at all, as far as I can tell. Am I doing something wrong?

How could we reproduce this issue?

Tail your pm2 logs using pm2 log, then run the following test file with pm2 start test.js --kill-timeout 20000 --listen-timeout 20000 --wait-ready:

console.log(new Date(), 'New process launched')
var http = require('http');
var app = http.createServer((req, res) => {
    // Simple request handler that waits for a long time and then emits a hard-coded string.
    setTimeout(() => {
        res.writeHead(200);
        res.end('MOOOOOO');
    }, 10000);

});

process.on('SIGINT', function() {
    console.log(new Date(), 'Got SIGINT')
    // Stop accepting new connections, wait for all existing connections to be handled, and then exit.
    app.close(() => {
        console.log(new Date(), 'exiting')
        process.exit()
    });
});

process.on('message', function(msg) {
    // Never gets called
    console.log(new Date(), 'msg:', msg)
})

app.listen(1234, function() {
    console.log(new Date(), 'Listening on port 1234');

    // Tell pm2 that we're ready to replace the old process
    process.send('ready');
});

Then:

  • optionally run curl localhost:1234 (not needed to see that something is wrong)
  • run pm2 gracefulReload all

Expected behaviour:

The following things happen in order:

  1. a new process spawns and starts listening for connections
  2. the old process receives SIGINT, stops listening for connections, finishes handling any requests it's already handling, then exits

There should be zero downtime with the flow described above.

Observed behaviour

The new process doesn't even launch until the old process is completely terminated, as shown by these logs that are printed when I do pm2 gracefulReload all:

PM2        | Process 0 in a stopped status, starting it
PM2        | Stopping app:test id:0
0|test     | 2017-08-09T13:27:22.385Z 'Got SIGINT'
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
PM2        | pid=66127 msg=failed to kill - retrying in 100ms
0|test     | 2017-08-09T13:27:27.742Z 'exiting'
PM2        | App [test] with id [0] and pid [66127], exited with code [0] via signal [SIGINT]
PM2        | pid=66127 msg=process killed
PM2        | Starting execution sequence in -fork mode- for app name:test id:0
0|test     | 2017-08-09T13:27:28.050Z 'New process launched'
0|test     | 2017-08-09T13:27:28.083Z 'Listening on port 1234'
PM2        | App name:test id:0 online

Supporting information

The docs at http://pm2.keymetrics.io/docs/usage/pm2-doc-single-page/ seem to back up my expectations of what the behaviour should be; they state:

When PM2 starts a new process to replace an old one, it will wait for the new process to begin listening to a connection or a timeout before sending the shutdown message to the old one

This is very clearly not what's happening here! Am I doing something wrong, or misunderstanding the docs, or is this feature fundamentally broken? I'm pretty confused!

@vmarchaud
Copy link
Contributor

gracefulReload is deprecated, could you try with pm2 reload ?

@ExplodingCabbage
Copy link
Author

ExplodingCabbage commented Aug 21, 2017

gracefulReload is deprecated

Hmm, you've got a documentation bug at http://pm2.keymetrics.io/docs/usage/pm2-doc-single-page/, then. It's documented there with no indication that it's deprecated.

could you try with pm2 reload?

I'm pretty sure I did this and saw the same behaviour. However, after work tonight I will recreate my test case and try with reload just so I can tell you that with 100% certainty.

@vmarchaud
Copy link
Contributor

Indeed i took the time to remove it this time. Tell me if its also happening with reload

@ExplodingCabbage
Copy link
Author

Sorry for delay, but have just confirmed same behaviour with pm2 reload all. Here's the log output from after I run pm2 reload all:

PM2        | Process 0 in a stopped status, starting it
PM2        | Stopping app:test id:0
0|test     | 2017-09-03T20:27:20.950Z 'Got SIGINT'
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
PM2        | pid=3836 msg=failed to kill - retrying in 100ms
0|test     | 2017-09-03T20:27:29.819Z 'exiting'
PM2        | App [test] with id [0] and pid [3836], exited with code [0] via signal [SIGINT]
PM2        | pid=3836 msg=process killed
PM2        | Starting execution sequence in -fork mode- for app name:test id:0
0|test     | 2017-09-03T20:27:30.041Z 'New process launched'
0|test     | 2017-09-03T20:27:30.058Z 'Listening on port 1234'
PM2        | App name:test id:0 online

@brendonboshell
Copy link

@ExplodingCabbage This could possibly be related to an issue I filed #3143. Can you try reverting commit d0a3f49 and see if you observe the same behavior? I have been unable to get zero-downtime reloads since 2.1.1. It could be another issue as I have only tested with cluster mode.

@ExplodingCabbage
Copy link
Author

Just FYI @brendonboshell, I'm not a contributor or even a user and have never built pm2 - so nah, I'm not inclined to figure out how to run a custom build. Should be trivial for a contributor to follow my instructions above to repro though.

@azazdeaz
Copy link

azazdeaz commented Nov 9, 2017

I can easily reproduce this by delaying the listen command of this example

// testpm2reload.js
var http = require("http"),
    app = require("express")();

app.use("/", function (req, res) {
  return res.send('hello');
});

var server = http.createServer(app);
setTimeout(function () {
  server.listen(4000, function () {
    process.send('ready');
  });
}, 5000)

process.on('SIGINT', function() {
   server.close(function(err) {
     process.exit(err ? 1 : 0);
   });
});
  1. start the server pm2 start testpm2reload.js
  2. reload it pm2 reload testpm2reload --wait-ready --listen-timeout 30000
  3. and here the site is unresponsive for seconds

If i see right, pm2 first terminates the currently running process and then starts the new one. So i don't understand how is it possible to get zero downtime (like with naught) ?
Or is it not possible to have zero downtime without cluster mode?

using pm2 v2.7.2

@brendonboshell
Copy link

@azazdeaz My understanding (although I could be wrong) is that --wait-ready should be passed to pm2 start, not pm2 reload. Do you experience the same issue in that case?

@azazdeaz
Copy link

azazdeaz commented Nov 9, 2017

Yes, it is the same if i use --wait-ready for both start and reload

@guyellis
Copy link

@vmarchaud

gracefulReload is deprecated, could you try with pm2 reload ?

I commented on another issue and mentioned @Unitech about this. The deprecation warning for gracefulReload was commented out as if the decision about removing this had been reversed.

Can you clarify its state?

@SharksevenRo
Copy link

This question has been bothering me for a long time. I solve this problem by set 'listen_timeout' bigger then default(3s). pm2 listen the 'ready' signal in 3s since you excute pm2 cmd。 so when your app start spend more 3s. please add the paran --listen-timeout=10000

@fullofcaffeine
Copy link

I'm having the same issue. reload doesn't seem to work at all.

@endel
Copy link

endel commented Feb 3, 2020

Hi there, this issue has been reported in 2017, are there any blockers to solve this? I believe the expected behaviour described by @ExplodingCabbage is very reasonable, I'm also testing the reload command right now, and from my perspective, that's the behaviour that should happen to avoid downtime.

@stale
Copy link

stale bot commented May 23, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 23, 2020
@ExplodingCabbage
Copy link
Author

Dear stale bot: this is presumably not fixed, and was a serious enough bug that it totally defeated the entire reason I wanted to use this tool 3 years ago; it's pretty egregious that one of the primary advertised features essentially just doesn't work at all. Until this gets fixed, it should remain open.

@stale stale bot removed the stale label May 23, 2020
@DrakeAnglin
Copy link

I am also having the same issue, anytime PM2 restarts the app, either by running pm2 kill, restart, or start commands or if it restarts on its own due to watch, it gets the "failed to kill" error. in order to clear the error state of the app, i need to go to task manager and kill all node apps running and then run the pm2 start command.

Need this to get fixed soon, trying to push this to production but it cant go like this.

image

@stale
Copy link

stale bot commented Jul 18, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 18, 2020
@ExplodingCabbage
Copy link
Author

Go away, stale bot.

@stale stale bot removed the stale label Jul 18, 2020
@amit777
Copy link

amit777 commented Sep 19, 2020

I have this problem as well. Have tried a lot of things but cannot get my Apache Ab tests to run clean during a reload.

@amit777
Copy link

amit777 commented Sep 19, 2020

Actually, maybe could someone provide an example where the expected behavior works?

@TimTheToaster
Copy link

I finally got this working! The fact that reload only performs as expected in cluster mode should be much more prominently indicated. I got it working by adding this to my ecosystem.config.js file (under apps):

    exec_mode: 'cluster',
    instances: 1,

@TimTheToaster
Copy link

This might help someone

@owenallenaz
Copy link

I am having this exact issue. We are running the application in exec_mode : "cluster" mode and about 1 in every 1000 reloads it gets stuck in this:

2021-10-05T22:20:35: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:35: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:35: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:35: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:35: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:35: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:36: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:36: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:36: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:36: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:36: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:36: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:36: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:36: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:36: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:36: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:37: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:37: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:37: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:37: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:37: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:37: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:37: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:37: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:37: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:37: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:38: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:38: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:38: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:38: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:38: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:38: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:38: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:38: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:38: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:38: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:39: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:39: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms
2021-10-05T22:20:39: PM2 log: pid=18349 msg=failed to kill - retrying in 100ms

The new version of the process starts, the old one is just stuck in this and I don't know why.

@shikhartechxle
Copy link

Hi from 2023 and suddenly facing the same with pm2 latest version,
can someone help here?

@callmeteus
Copy link

Still facing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests